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ABSTRACT 


The CORDIC algorithm is an accurate way to compute the value of a function 
like sin(x), for a given value of x. However, it is iterative and slow. In this thesis, we 
show that a wide class of arithmetic functions can be realized on the SRC-6, a 
reconfigurable computer, using polynomial approximations. The function is realized by 
partitioning its domain into segments and then approximating the function in each 
segment by a quadratic polynomial. This is not an iterative approach, and so it is faster 


than the CORDIC algorithm 


Two approximation methods are implemented. In one method, non-uniform 
segments are used. Here, larger segments can be used where the function is close to 
quadratic, while highly non-quadratic regions require smaller segments. This approach 
minimizes the number of segments. In the other method, uniform segments are used. 
Although more segments are needed than in the non-uniform method, the circuit is 


simpler. 


We show that accuracies of up to 33 bits are possible. A pipelined circuit was 
built on the SRC-6 in two’s complement and floating point. We also show an efficient 


algorithm for segmenting the function, which is faster than previous methods. 
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EXECUTIVE SUMMARY 


This thesis focuses on the high-speed implementation of arithmetic functions, 
such as sin(zx), In(x) and 2°. Meteorological computations, scientific calculations and 


graphics are applications that require fast mathematical computation. 


The CORDIC algorithm and Taylor series expansion are methods used to 
compute trigonometric functions. The CORDIC algorithm is hardware efficient, precise, 


but iterative in design and therefore slow. 


In this thesis, we investigate a way to speed up mathematical computations by 
using piecewise quadratic approximations built on reconfigurable hardware. The 
function is realized by partitioning its domain into segments and then approximating the 
function in each segment by a quadratic polynomial. This is not an iterative approach, 


and so it is faster than the CORDIC algorithm 


The reconfigurable hardware used is the SRC-6E that is designed by SRC 


Computers in Colorado Springs, Colorado. 
The objectives were to: 


e Find an efficient algorithm to segment any numeric function using 


piecewise quadratic approximations. 


e Find an accurate segmentation (accurate when evaluated using the 
approximation polynomial) to any numeric function given an accuracy 


constraint in terms of number of bits. 


e Design pipelined hardware for the Numeric Function Generator (NFG) 


with a small pipeline depth (compared to what is currently available). 
e Design NFG to operate at 1OOMHz or faster on the FPGA. 


Segmentation is a preliminary step to provide a memory file that contains the 
number of segments for the numeric function, and each segment’s coefficients needed to 


compute the approximation polynomial. 
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MATLAB is used to segment any function over a defined interval. The 
MATLAB program needs to know the function, interval, desired accuracy and the 
number of discrete points in the interval. The MATLAB built-in function, Polyfit, was 
used to compute the coefficients of the approximation polynomial, but analysis showed 
that the approximation computed using this method did not efficiently segment the 
function. Polyfit is computationally fast, but results in an inefficiently segmented 


function. 


The Remez algorithm is used to efficiently segment the numeric function. The 
Remez algorithm evenly distributes the approximation error on each segment, but is 
computationally intensive and slow. Several methods were investigated to speed up the 
algorithm. The best method to speed up the program, involved a hybrid of three methods. 

e Segment width estimation that requires the third derivative of the numeric 
function and the accuracy desired by the user. 
e Search algorithm similar to a binary search 


e Single stepping through points and testing to determine if the accuracy has 
been met. 


The program computes an estimated segment width and a metric is used to 
determine the quality of the estimation. If the metric indicates the estimation quality is 
poor, then the program will use the search algorithm to get closer to the optimum width. 
In the final step, the program single steps through the points and tests each approximation 
to determine when the accuracy has been met. When the segmentation of the function is 
complete, the optimum segment width and the associated coefficients are saved in a 


memory file for use in the NFG. 


The segmentation algorithm sped up the program tremendously. If the domain is 
divided into over a million points, the original program would take at least one million 
tests to segment a function. In each test, the program computes the coefficients and tests 
the polynomial against the numeric function to see if the accuracy is met. When the 
speed up algorithm is used, the program requires much less than 0.1% of the number of 


tests than without the speed up. Table 1 shows the results when 15 functions were tested. 
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The interval is shown in the second column, the speed up is shown in percentage format 
in the third column and the last column shows the number of segments. The percentage 


: # ts x 100 
is computed as: OL TESS é 
























































1,000, 000 
Epsilon = 0.0000000596 = 2%-24.0. N = 1000000 
Function Interval SOf tests # of Segments 
2K [0,1] 0.00910 35 
1+/x [1,2] 0.01020 50 
sqrt (x) [1,2] 0.00750 24 
1/sqrt (x) [1,2] 0.00720 36 
log2 (x) [1,2] 0.00900 44 
log (x) [1,2] 0.00780 39 
Sin (pi*x) [0,1/2] 0.01990 58 
cos (pi*x) [0,1/2 0.01740 58 
tan (pi*x) [0,1/4] 0.01240 58 
SQEE (=LOG (x... [1/512,1/4] 0.04070 163 
tan (pi*x).%... [0,1/4] 0.02180 719 
=(X* LOGZ (3) acs [1/256,1-1/256] 0.04710 183 
1/ (1+exp(-x... [0,1] 0.00920 20 
(1/sqrt(2*p... [0, sqrt (2) ] 0.01670 45 
Sin(exp (x) ) [0,2] 0.07810 2:65 
KKEKKKKKKKKKKK KK KK KKKKKKKKKKKKKKKKKKKKKKEKKKKKKKKKKKKKKKKK 
Table 1. Speed-up in computation time for 15 functions (expressed as a percentage 
of the time needed when the domain is divided into 1,000,000 points) 
for & = a . 


The NFG circuit consists of three multipliers, one 3-input adder, a segment 
indexing method and the memory that contains the approximation polynomials’ 


coefficients for each segment. 


Figure | is a block diagram that shows an overview of the NFG circuit. 
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Figure 1. Numeric function generator (NFG) architecture. 


Two approximation methods are implemented. In one method, non-uniform 
segments are used. Here, larger segments can be used where the function is close to 
quadratic, while highly non-quadratic regions require smaller segments. This approach 
minimizes the number of segments. In the other method, uniform segments are used. 
Although more segments are needed than in the non-uniform method, the circuit is 


simpler. 


We show that accuracies of up to 33 bits are possible. A pipelined circuit was 
built on the SRC-6 in two’s complement and floating point. The floating point 


implementation is easier to program via the interface that SRC provides. A 
XXIl 


<subroutine>.mc file is a C-like file that is compiled into the hardware that resides on the 


FPGAs in the SRC Multi-Adaptive Programming (MAP) board. 


Using fixed point implementation produces a_ shorter pipeline depth 
(approximately 30% of the floating point pipeline depth), but requires more effort by the 
programmer to ensure the bits are aligned correctly. In fixed point implementation, the 
bits are truncated instead of rounded. This introduces errors in the intermediate 


computations that propagate to the final answer. 


The best solution to this problem is to build a user macro multiplier that takes care 
of the rounding and ensures the bits are aligned in the intermediate results of the 


polynomial computation. 
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I. INTRODUCTION 


A. PROBLEM STATEMENT AND PURPOSE 

High-speed numeric computation has many applications including digital signal 
processing, graphics rendering, meteorological modeling, etc. These applications require 
numeric calculations to be computed quickly. In addition, the hardware may be required 
to compute large amounts of data or streaming data, which means long periods of time, 
may be expended performing the one type of computation. Personal computers are 
general purpose and not specifically designed for numeric calculations alone; instead they 
provide the best compromise between speed and flexibility. 

The CORDIC algorithm can be very precise, but it has the disadvantage of being 
iterative and slow; the operations can take hundreds to thousands of clock cycles. Each 
iteration in the CORDIC algorithm provides increased accuracy at the output [4]. 

It would be beneficial to have specialized and fast hardware for high speed 
numeric calculations. Conventional methods for computing numeric functions include 
the CORDIC algorithm [2], [3], [4]. The problem is that specialized hardware is 
inflexible to computing different numeric functions as well as to changes in requirements 
or software updates. However, specialized hardware is fast. 

A very fast method for numeric calculations is a look-up table [5], i.e. for every 
possible input, store the desired output of the numeric function. The disadvantage of this 


approach is that a large amount of memory is needed. 


Field programmable devices have the advantage that one can quickly design, test 
and replace hardware functionality. This is compared to traditional methods, whereby a 
prototype is designed and simulated in software, prototyped on a prototyping board, and 
then sent to a manufacturer. This is expensive and time consuming, especially if there 


are changes required. 


FPGA technology has improved to the point that a large amount of logic is 
available. If we have a few divergent needs that may require particularly heavy- 
computation that can best be solved by specialized hardware, we can use the FPGA 


devices to implement a specialized hardware design. Once the task has been completed, 
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the hardware can be reconfigured for other uses. The NFG we will discuss uses this 


principle on the SRC-6 computer system. 


Lee, Wayne, Villasenor and Cheung [6], used a cascade of AND and OR gates to 
calculate segment addresses in a non-uniform segmentation implementation for hardware 
function evaluation. This circuit is useful for a limited class of functions. Sasao, Butler 


and Riedel [5] present a universal circuit that can cater to a wider class of functions. 


Sasao, Butler and Riedel [5] have shown that elementary and non-elementary 
numeric functions can be computed quickly and accurately using a piecewise linear 
approximation method. This method provides some advantages over the memory method 
and the CORDIC algorithm. Less memory is required than a look-up table because the 
numeric function is segmented and the coefficients of the piecewise linear approximation 
are stored vice storing every possible input value and its corresponding output. The other 
advantage is that the accuracy can be determined at the outset and therefore is faster than 
the CORDIC algorithm; especially at higher accuracy when the CORDIC must go 
through several iterations to attain the desired accuracy. One more advantage to this 
approach is that it allows for one hardware design, with the memory contents being 


changed to handle different numeric functions [1]. 


This thesis investigates a piecewise quadratic implementation. The quadratic 
implementation requires fewer segments than the linear implementation to compute the 
same numeric functions to the same accuracy. This also means that the memory required 


is less than that required to implement a piecewise linear approximation NFG. 


B. IMPLEMENTATION OVERVIEW 


Figure 1 shows of the hardware required to build the NFG using quadratic 
approximation. The NFG architecture requires three multipliers. Each requires 


significant logic and causes significant delay. 
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Figure 1. Numeric function generator (NFG) architecture. 


Table 1 shows the suite of functions used to test and design the NFG. Unlike 


logic or software design, there is no set of benchmarks. The specific functions have been 


chosen because they have appeared in previous papers on this subject [1], [5], [8], 


[9],[11], [12], [15]. 





















































1 2 [0,1] 1.2] 

2 I/x [1,2] [1/2.1] 

3 Ja [1,2] | 0,72 | 

4 vx [1,2] [1/V2.1] 

5 log, (x) [1,2] [0.1] 

6 In(x) [1.2] [0,In2] 

a sin(s7x) [0,1/2] [0.1] 

9 tan(zx) [0.1/4] [0,1] 

10 Hine) [1/512,,1/4] | (in (1/4), in 0/512) | 

11 tan’ (rx) +1 [0,1/4] [1,2] 

12 | -(x log2x + (1-x) log2(1-x)) | [1/256,1-1/256] [0,1] 

° ne fi Ure 

4 _ [0.2] ees] 
oa V2n° V2ze' 

15 sin(e*) [0,2] [1,-1] 




















Table 1. Suite of numeric functions and their domains. 
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C; THESIS ORGANIZATION 

This thesis is organized into six chapters. Chapter I is the introduction, Chapter II 
covers the segmentation of numeric functions and the methods used for computing the 
approximation of the functions; this includes the discussion on how the coefficients were 
computed and how the memory files were used in the NFG. These programs were 
designed in MATLAB [7]. In Chapter II, the circuit description design is covered. 
Chapter IV introduces the SRC computer architecture. The experimental results are 
discussed in Chapter V. The summary and suggested future work is discussed in Chapter 


VI. 
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I. FUNCTION APPROXIMATION 


The NFG approximates the realized function by polynomial. In a typical 
realization, many polynomials are used. A segment is a sub-domain in the interval of 
approximation where one polynomial is used to approximate the function. In this thesis 
quadratic polynomials are used. The benefit of using a polynomial approximation is that 
only one hardware design is required to realize a multitude of functions. The only change 
required to the hardware is to change the specific endpoints of the segmentation of the 
functions to be realized and the associated coefficients. The segmentation endpoint and 
coefficients are generated in MATLAB and are stored in a memory file. Segmentation is 


described in detail below. 


The realized functions are approximated and the output of the hardware is only as 
accurate as the user-defined precision. The approximation error is¢é. The exact function 
is evaluated for various values in the domain. The polynomial that is used to 
approximate the function is evaluated for the same values in the domain. The difference 
between these two results is the approximation erroré. The approximation erroré is the 


constraint used to keep the approximation in check. 


The approximation error ¢, directly impacts how many segments are required and 
therefore dictates how much memory is used to store the coefficients. Small values 


require many segments. 


A. QUADRATIC VS LINEAR 

Nagayama, Sasao and Butler [8] showed that using quadratic approximations in 
the NFG requires an average of only 4% of the memory required when using linear 
approximations. This gives the motivation to pursue quadratic approximation following 


the work on linear approximation that was performed by Mack [1]. 


In Table 2, the number of segments required for different accuracies is tabulated 
for both quadratic approximation and linear approximation. A column is also included 


that shows the ratio of quadratic to linear segments required. 


My 





















































2 175 +| 933 | 35/849 |4.12|] 278/19008 | 1.46 
Ux 10/75 | 13.33] 50/849 |5.89| 400/18996 | 2.11 
Ae 5/335. | 14.29| 24/388 [6.19] 189/8729 | 2.17 
ios 8/50 | 16.00! 36/565. | 6.37| 288/12684 | 2.27 
log2(x) 9776 | 11.84| 44/853 | 5.16] 351/19097 | 1.84 
in) 8/63. | 12.70| 39/710 |5.49| 311/15927 | 1.95 
sin(zx) 12/109 | 11.01] 58/1227 |4.73| 461/27361 | 1.68 
cos(ax) 12/109 | 11.01| 58/1227 |4.73| 459/27361 | 1.68 
tan(rx) 1773 | 16.44) 58/822 |7.06| 459/18371 | 2.50 
zinc) 33/207 | 15.94| 163/2356 | 6.92| 1312/47188 | 2.78 
tan2(zx) +1 16/152 | 10.53| 79/1721 |4.59| 631/38087 | 1.65 
“(x log2x + (1-x) log2(1-|__37/314__| 11.78 | 183/3556 | 5.15 | 1459/76334 | 1.91 
X)) 
1 4/20 | 20.00] 20/226 |8.85| 158/5087 | 3.11 
l+e* 
1 = 9/53. | 16.98| 45/595 |7.56| 357/13312 | 2.68 
a 
sin(e") 54/449 | 11.80 | 265/5099 | 5.20 | 2121/101065 | 2.10 























Table 2. Segmentation required for linear and quadratic approximations. 


To calculate the memory required for a single segment, one needs to take into 


account that memory for linear approximations only requires two quantities (slope and 


intercept) and memory for quadratic approximation requires three quantities. That is a 


50% increase in memory requirements for a single segment when compared to linear. 





However, the sheer difference in number of segments required for quadratic vice linear, 


more than counterbalances for the increase in memory requirements 


Table 2 shows that quadratic approximations can cover more functions with fewer 
segments than linear approximations and on average, quadratic approximations take up 
only 4% of the memory required to represent the same function when using linear 


approximations [8]. 


B. SEGMENTATION 

To evaluate a numeric function using polynomial approximation, we need to 
segment the domain of the numeric function such that each segment has one set of 
coefficients that evaluate to the polynomial approximation of the given numeric function. 
The polynomial approximation needs to satisfy the user defined ¢ such that any value in 
the domain that is evaluated using the polynomial will produce an output f(x) that has an 
error no greater thangin magnitude. The segmentation is performed in MATLAB 


routines. 


Segmentation can be performed using either uniform or non-uniform segments. 
The coefficients of the approximation polynomial can be computed using Polyfit [7], 
which is a built-in MATLAB function or the Chebyshev and the Remez [13] algorithm 


which is a user function. We will discuss these approaches in more detail. 


if Uniform and Non-Uniform Segmentation 

There are two general methods used in approximating a function; uniform and 
non-uniform segmentation. Different functions behave differently when segmented using 
uniform or non-uniform segmentation. Non-uniform segmentation allows the user to take 
advantage of functions that have both rapidly changing and non-rapidly changing 
sections. When functions have sections of high curvature, non-uniform segmentation can 
create smaller segments to ensure the polynomial approximation does not exceedé. The 
more quadratic or linear the function is, the better the polynomial approximation can fit a 


quadratic polynomial to it. As a result, segments are longer in regions where the function 
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is linear or quadratic. The goal is to achieve the fewest segments possible and yet 


achieve the approximation error specified by the user. Figure 2 shows the non-uniform 
segmentation of ./—In(x) using ¢=27'°(accurate to 16 binary bits). This function 


illustrates the advantage of non-uniform segmentation. The smaller segments are located 


at the beginning of the domain and the larger segments are at the end. 


NON-UNIF ORM fx)=saqrt(4og{x)) segmentation. No. ofseqments = 2. 





Figure 2. Quadratic segmentation of ./—In(x) shows the difference in the size of 
segments due to curvature of the function. 


As mentioned above, the error associated with this segmentation should not 
exceed é =27'°. Figure 3 shows the error across the interval of approximation when non- 
uniform segmentation is used. For all but the last segment, the maximum absolute error 


is the same (about 27'°""! 


). As shown in Figure 3, the error does not exceed ¢ anywhere. 
Note that the error in the last (right most) segment is much less than in all other segments. 
This is because the last segment is truncated by the boundary of the domain interval 
before the algorithm has a chance to maximize the size of the segment. 
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x 49 Error for NON-UNIFORM fix)=sqrt(-log(x)) segmentation. Na. of segs = 26. 
2 


1.5142e-005= 216.0111) 


oO 


Approxiamtion Errorin’ (Max Value 





Figure 3. Segment error of ./—In(x) when ¢=27°. 


Figure 4, shows the approximation error in the case of this same function when 
uniform segmentation is applied!. To achieve uniform segmentation within the same 
approximation error specification i.e.2'°, we are required to use the width of the 


narrowest segment which in this case is the very first segment. 


1 Because a large number of segments are required, the line width occupies the whole of the figure, 
making it appear completely solid. 
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UNIFORM fix}=sqrt(-log(x)) segmentation. No. ofsegments = 714. 


25 





Figure 4. Quadratic uniform segmentation for ./—In(x) when limited when ¢=27"°. 


The error function for a uniform segmentation looks different from that of the 
non-uniform segmentation. The error for uniform segmentation is maximum Le. € 1s 
attained in the most limiting segment. However, when looking at the other segments the 


error does not reache. Therefore a tapered effect is observed. To best demonstrate this 
effect, we shall use a less “dramatic” function than,/—In(x). Instead cos(zx) is used in 
Figure 5 and Figure 6 to show the difference in the error between the uniform and non- 


uniform segmentation. 


Below in Figure 5, the error is tapered showing that the earlier segments don’t 
take full advantage of the entire segment because they have been limited by the smallest 


segment, located at the end of the domain for the cos(zx) function. 


In Figure 6 however, you can see that non-uniform segmentation has taken full 
advantage of all the space and has fewer segments to represent the same function. This is 


the advantage of the non-uniform segmentation over uniform segmentation. 
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x 407 Emor for UNIFORM f(x)=cos(pi*x) segmentation. No. of segs = 14. 
8 


-2 


2%-17.055 
ee Se ee 


Max Error = 7.3441e-006 


Figure 5. Uniform segmentation error for cos(zx) when limited by ¢=27"’. 


x 10~ Error for NON-UNIFORM f(x)=cos(pi*x) segmentation. No. of segs = 12. 
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2-17. 
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nh 





Max Error= 7.6294e-006 


Figure 6. Error for non-uniform segmentation for cos(zx) when limited by ¢=27'’. 
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In the segmentation of a numeric function, a user interface was designed in a 


MATLAB program to get the user’s choices. The user interface allows the user to select 


which function he/she would like to segment and allows the user to select the number of 


points (to subdivide the domain), ¢ , and whether uniform or non-uniform segmentation is 


used. 


If the user selects non-uniform segmentation, the interface looks like that shown 


in Figure 7. 


FEE KKKKKKKKKKATATAAKATKKKK KKH HHH HAAATATATATKKKRTKHHTK HH HHHAATAATAAAAS 


QUADRATIC APPROXIMATION OF 4&4 FUNCTION USING CHEBYCHEV 
AND REMEZ AlGORITHM 


FEET KTKKKATATATATATATATATKKKKTRTRTRATKATATAKATAATAKATATKKKKRRRKTHAHHATAHATKAAAAAAATE 


Functions to be compared Interval 

1. 2°x [0,1] 

2a 2/x [1,2] 

3. sqrt{(x) [ip2) 

4. if/sqrti{x) (i152) 

5S. log2 (x) [1,2] 

6. log{x) = Iln{(x) [ies 

7. sin{pi*x) [0,1/2] 

8. cos{pi*x) [O;1/21 

9. tan(pi*x) [0,1/ 4] 

10. sqrtt-logi{x)) = sqrt(-lnix)) [1/256,1/ 4] 
11. tan(pi*x)*2 + 1 [O, 1/4] 

12. -—(x*log2(x) + (1-x) *log2 (1-x) } [1/256,1-1/256] 
13. 1/ (1+exp(-x)) = 1/ (1+e* (-x)) [G; 1] 

14. (1/sqrt(2*pi)) *exp(-x*2/2) [O0,sqrt(2)] 
15. sinfexp([(x)) [0,2] 


KEEKKKKKAKKKAATAATAATATATKKKKK KKH ATAHHAKAAKAATATARATTARKKKRKHHTKHH HAHAHAHAHAHA 


Input the Function, func[sqrt{(-1*logi{x))]: 

(1) Non-uniform or (2) Uniform Segmentation or [{3)Both [1]: 
Input the Desired Error, epsilon[2*-33]: 2%*-16 

Input the no. of pts the fet is to be evaluated, N[1000000)]: 


FEET KKKKHKTAHAKAAHATTATKKKKKTKT TKK HHHAHHAHAAAAAAAKTARKEREHREHTATT 


Figure 7. 


Quadratic approximation user-interface when non-uniform segmentation has 
been used. 
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If the user selects non-uniform segmentation, the user interface allows the user to 
select whether he/she wants to specify ¢ or if they would like to use a fixed number of 
segments instead. The new user interface looks like that shown in Figure 8. 


KKEKKKTKKKKKKKAKATKKKKKAKKAKKK RAT HKAARAKKKT HAART KATH AAARARKK HAHAHAHAHA T 


QUADRATIC APPROXIMATION OF & FUNCTION USING CHEBYCHEY 
AND REMEZ A1GORITHM 


KEEKKKKKKKKKAKKATARKKKKATKAATKK KATA AKAARRKKKATAARAKKKKAK KARR KATAARARRREATT 


Functions to be compared Interval 

Li SST [0,1] 

2 WLYX [1,2] 

3. sqrt(x) [43:24 

4. 1fsqrt{x) [13:2] 

5S. log2{x) [1,2] 

6. log{x) = Iln{(x) [1,2] 

7. sin({pi*x) [0,1/2] 

8. cos{pi*x) [0,1/2] 

9. tan({pi*x) [0, 1/4] 

10. sqrt{-log({x)) = sqrt{-lnt(x)) [1/256,1/4] 
110 “‘tan(pits) AZ. 2 [0, 1/4] 

12. -(x*log2(x) + (1-x) *log2 (1-x)) (1/256, 1-1/256] 
13. 1/ (1+exp[(-x)) = 1/ (1+e* (-x)) [0,1] 

14. (1/sqrt(2*pi)) *exp(-x*2/2) [O,sqrt(2)] 
15. sin(exp(x)) [0,2] 


i ee ee 


Input the Function, 
(1) Non-uniform or 


func [sqrt (-1*log(xj})]: 
(2) Uniform Seqmentation or 


{3)Both [1]: 2 


Would you like to constrain [(1)Number of Segments or [{2)Error [1]: 


Input the number of Desired Segments[20]: 


Input the no. of pts the fet is to be evaluated, N[1000000)]: | 


Figure 8. 


Quadratic approximation user-interface when uniform segmentation has been 


specified. 


a. Summary of Advantages and Disadvantages of Uniform and 


Non-Uniform Segmentation 


Table 3 shows a summary of the advantages and disadvantages between 


uniform and non-uniform segmentation. 
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Advantages Disadvantages 





e No need for segment index e High curvature functions 
encoder require many segments 
(wastes memory) 


Uniform 
Segmentation 


e Less complex hardware 








e High curvature functions with | ¢ Requires segment index 
segments that are as wide as encoder 
possible (Saves on memory) 


Non-Uniform 
Segmentation 


e More complex design 











Table 3. Summary of Advantages and Disadvantages of Uniform and Non-uniform 
Segmentation. 


2. Segment Coefficients Using Polyfit and the Remez Algorithm 

To obtain the coefficients of a segment when segmenting any function, several 
different algorithms may be used. In [5], Sasao et al use the Douglas-Peucker algorithm 
[10] for segmenting and providing linear approximations to the functions. However this 


algorithm does not yield an optimum segmentation [11]. 


The initial work in this thesis used the Polyfit [7] function, available in 
MATLAB, to find the coefficients. Polyfit is computationally efficient and has been 
optimized for MATLAB. It requires a set of data points that represent the function that 
the user intends to best fit a polynomial of order n. In this thesis, we are working with 
quadratic functions and therefore use n = 2. Polyfit finds the coefficients to the 
approximating polynomial in a least squares sense [7] and returns a row vector with the 
coefficients of the polynomial. Least squares approximations minimize the average error 
on the interval selected. However, the worst-case error can be large. That is, it yields an 
average error that satisfies the constraint given, 1.e.¢, but the worst-case errors may still 


exceed the constraint. 


In analyzing the approximation polynomials produced from the coefficients 
provided by the Polyfit function, the graphs showing the error over each segment had the 


largest error at the begin and end points of the segment as can be seen in Figure 9 below. 
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This graph shows the weakness in using least squares approximation methods like 
that used by Polyfit. Our goal is to reduce the number of segments for the given function 
in order to restrain the maximum error to no greater thane. Therefore, Polyfit was 


abandoned and instead the Remez algorithm [13] was used. 


The Remez algorithm uses a method of approximation that minimizes the worst- 
case error. It belongs to the set of least maximum approximations (minimax 
approximations). The program ensures that there was no point in the interval where the 
error found by evaluating the difference between the approximation polynomial and the 


real function was greater than the constraint given. 


x 10° Error for NON-LNIFORM f(x) segmentatior. No. of segs = 14. 
8 


Error(x). Max Error = 2°17.00348 
o 


' 
mn 
ey 


8 
0 002 004 006 028 O° 012 014 O16 0.18 
Xx 
Figure 9. Quadratic non-uniform segmentation approximation error using Polyfit. 


The advantage of the Remez algorithm is to evenly distribute the error over the 


segment so that the maximum error is constrained by é. This can be clearly observed by 
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comparing Figure 9 and Figure 10. The function, cos(zx) withe =2"'’, was used in 


both cases. Notice Polyfit needed 14 segments while Remez only required 12 segments. 
Both figures display only the first 4 segments. The difference is readily noticeable. Thus 
the Remez covers a larger portion of the domain in the four segments than Polyfit. As a 
result, it tends to reduce the number of segments. In the Remez implementation, the ae 


segment extends right past 0.21 in the x domain, while Polyfit barely makes it to 1.9. 


The Remez algorithm attempts to achieve the minimax degree-n polynomial 
approximation of the given function on a defined interval. In the program that was used 
for this thesis, the interval is iteratively revised and the Remez algorithm is repeatedly 
called until a degree-2 polynomial approximation that satisfies the constraint is achieved. 
The process is constrained bye, and the interval is increased or decreased until the 
optimum segment endpoint lies between the current point and the next point on the 


domain interval. 


x 19° Emor for NON-UNIFORM fix}=cos(px) segmentation. No. of segs = 12 
8 


7.0004. 


mM: 


7.527 4e-006 = 


Max Error 





0 0.02 004 006 008 O1 O12 O14 O16 O18 O02 


Figure 10. Quadratic non-uniform segmentation approximation error using Remez. (Only 
the first four segments are shown). 
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The Remez algorithm requires much more computational time and effort than the 
Polyfit function (which is already optimized for MATLAB). In general, for an f with an 
interval [a, b], there are several polynomials, but only one polynomial p* is the minimax 
degree-n approximation. This approximation will have at least n+2 points, as described 
in inequality (0.1) that evaluate to yield an error that will be maximum magnitude and 


will alternate in sign. 


O55, 54 See (0.1) 


n+l 


The begin point and end point of the interval are included. In the case of 
quadratic approximations, a degree-2 polynomial can expect at least 4 points where the 
error will be maximum and will alternate in sign, as seen in Figure 10. The Remez 
algorithm is iterative and requires an estimate of the point where the error is maximum. 
The Chebyshev approximation is better than most other approximation algorithms in 
obtaining a polynomial close to the minimax polynomial p*. When compared to Taylor 
Series, Legrendre, Chebyshev provides a better estimate in most cases. For this reason, 
Chebyshev approximation is used to provide a set of starting points in the Remez 


algorithm in this thesis. The previous discussion is described in more detail in [13]. 


The function ChebyRemez in Appendix B was written to implement the Remez 
algorithm with an initial set of points where the error is maximum. Using Remez slowed 
down the program written to compute the coefficients; especially when higher accuracy 
was desired or in general, when the x domain interval was assigned more points; N. To 
neutralize this effect, different algorithms were investigated to speed up the program. 


These are discussed further in the section three below. 


3. Algorithms Investigated to Speed-Up the Segmentation 
In the program proposed by Sasao, Butler and Riedel [5], the domain was divided 
into points and segmentation was determined by brute force, i.e. point by point to 


determine the required size of the segment. To attain high accuracy, the domain needs to 
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be subdivided into hundreds of thousands and even millions of points. This results in 


slow execution. We investigate ways to speed up the segmentation. 


a. Brute Force 

The lower value of the domain is established as the begin point. The 
program steps through each point computing the minimax degree-2 polynomial 
approximation of the function. When evaluating any segment, (even two consecutive 
points), the program creates 1000 points between the given begin point and the end point. 
This ensures enough points for the program to locate the points in the segment where the 
maximum and minimum error is achieved, as described above. The coefficients required 
are then computed and next, the approximated polynomial is used to evaluate all the 
points in the current segment. These values are compared with the actual values from 
computing the real function. The maximum error is determined. If the error is smaller 
thane, the program steps one point to the right and repeats the process. Eventually, the 
polynomial approximation will produce an approximation where the maximum error 
exceedsé. At this point the program steps back one step and records the end point of the 
segment. For a typical segmentation with N =1,000,000, this program takes much time. 
N is defined as the number of points on the entire interval of the domain, i.e. number of 


points on the interval [a, b]. 


b. Binary Search 


Binary search is really a two step process: 
1. Locate: A point close to the optimum point is determined. 
2. Pinpoint: Use brute force to move up to the optimum point. 


In step 1, given a function f| and an interval [a, b], starting on the left at a, 
the lower value of the domain is established as the begin point and the end point is set to 
b. This is the entire domain interval over which the program computes the minimax 
degree-2 polynomial approximation. Given the constraint, ¢ , the program tests the error 


of the approximation and if the error is greater than the constraint, the program divides 
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the interval into two equal parts and decreases the proposed interval. Figure 11 shows a 
graphic representation of the first 4 iterations. These iterations are part of step 1; Locate. 
The optimum is endpoint of the first segment is labeled x,. Figure 11 


shows the first iteration, interval [a, b] is tested to determine if it is a good segment size. 
Since it is too large, the interval is divided into 2. The new interval is [a, 


1st proposed x,|. The process is repeated and the approximation of this new proposed 


segment is tested against the constraint. This is an iterative process that decreases the 


width of the segment. The next proposed segment is [a, 2™ proposed X,] as shown in 


Figure 11. Again the segment is tested. If the constraint is not met, the segment is 


decreased by 1/2. 


NON-UNIFORM f{ x } = 2% segmentstion. No. of segments = 7. 


3rd Proposed x0 9, 1st Proposed x0 


4 
4 
) 
{ 


é 





1 
' 
1 
1 
H 2 
x3 
! 
1 
1 
1 
0.5 G 


a x b 


Figure 11. Shows the interval and segmentation notation. 
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The process is repeated until the constraint is met. In Figure 11, the 
constraint is met on the fourth try and results in a proposed segment [a, 3rd proposed x, ]. 


Once below the optimum end point, the program increases the proposed segment 
endpoint until the constraint is exceeded. This means the segment is increased by half of 
the last width used to decrease the proposed segments. In Figure 11, the last width was 


2™ proposed X, - 3rd proposed x,. The process of increasing and decreasing the 


segment size by widths that are halved per iteration is repeated until the width being used 
to increment or decrement is 1. At this point, we are done with step 1 (Locate) and we 


move to step 2. Step 2 uses brute force to Pinpoint the optimum segment. 


The binary search finds the actual segment end point in approximately s 
steps as described by inequality (0.2) where npts is the number of points in the initial 


proposed segment. 


s 21+log,(npts) (0.2) 


Compared to the number of steps required by brute force, this is a 
dramatic improvement. Consider N=1,000,000, then the binary search for the first 
segment should yield around 21 steps to find the optimum segment end point x0; npts in 
this case is 1,000,000. The number of steps required to reach the segment end point is 
reduced as the program progresses to the end of the domain interval. This is because the 
argument npts in equation (0.2) decreases. In Table 4 the binary search takes 924 calls to 
the function chebyRemz as opposed to the brute force method which makes 1,000,000 


calls. 


The number of calls to the user programmed MATLAB function 
chebyRemz is used as a metric for two reasons: (1) the code for chebyRemz takes longer 
to execute than any other piece of code in the program and (2) the number of calls to the 
user programmed MATLAB function chebyRemz will vary depending on what numeric 


function is being segmented. Appendix D shows a copy of profile results [7] that shows 
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the execution time of each function. The goal is to minimize the number of calls to 


chebyRemz, thus speeding up the program. 


Appendix A.2.1, part b shows the portion of the program that applies this 


method. The file name is varQuadApproxBinSearch.m. 


Table 4 shows the number of calls to the function chebyRemez for 9 
different algorithms that were investigated to speedup the segmentation. The first 


column is the number of points used to subdivide the domain. The next 9 columns are 


the different algorithms and the results. Only one function and one accuracy was used; 


—In(x) and ¢ = 27’ respectively. 











100K | 764 640 699 | 6620 | 430 293 697 298 98 
10K 649 229 563 739 132 127 166 129 103 
1K 488 429 450 181 114 120 128 122 117 






































Table 4. Various methods show the number of calls to the function chebyRemz; 
segmentation of ,/—In(x) , ¢ = 2 '’and various values of N. 


Cc. Divide by Thirds 

A second program was implemented that applied the same principle as 
binary search, however instead of taking off half of the width, the program took off two 
thirds (i.e. divide the remaining width by three). Therefore this method is also a two step 


process: 
1. Locate: A point close to the optimum point is determined. 


2. Pinpoint: Use brute force to move up to the optimum point. 
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Figure 12 shows the segmentation for the 5" segment. The domain 


interval is [a, b], we start the segmentation of segment 5 at the end the 4"" segment; Ree 


Step 1: Denote the unsegmented part of the interval as [x,, b]. A call to 


the function chebyRemez is used to generate a quadratic approximation. This 
approximation is tested to see if any points exceed the constrainte. If the constraint is 


met, then we have the final segment. Exit. 


Step 2: Divide the initial width by three; the new value is 1/3 of the initial 
width. This is labeled as L1 in Figure 12. Ll is now the new proposed segment width 
and chebyRemez is called to establish a quadratic approximation for the interval. The 


point labeled x, is the optimum segment endpoint. In Figure 12, L1 is clearly not the 


optimal width. 


Step 3: The program divides L1 by three and the result is L2. A quadratic 
approximation is computed to test the approximation error against the constraint. Since 
L2 is below the optimum point, we initialize a new variable, delta, to be used to keep 
track of the width which is being added or subtracted to the proposed width of the 
segment. delta is 1/3 of L2. 


Step 4: Increase L2 by 1/3 of L2. This results in L3, which is tested to 
determine the approximation error. In Figure 12, L3 is still short of the optimum 


segment. 


Step 5: Increase L3 by the same delta, i.e. 1/3 of L2. The approximation 
is computed for the new proposed segment of width L4, and the approximation error 
tested against the constraint. This time we have exceeded the optimum endpoint, 1.e. 


approximation error is greater thane. In Figure 12, L4 is larger than the optimum point. 


Step 6: Since we have exceeded the optimum segment, we now reduce the 
variable delta to 1/3 of delta. This value is the used to reduce L4 to a narrower width, i.e. 


LS. In Figure 12, LS is still wider than the optimum width. 


When the increment width is 2 or less, Locate is complete and the program 


goes to Pinpoint. The process stops when two adjacent points straddle the optimum 
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segment endpoint. The lower value is x,, the segment endpoint for the program. Since 
the domain has been divided into discrete points, x, is just shy of the optimum point. 
The approximation error of the new segment meets the constraint; however, the next 


point to the right of the optimum point has an approximation error that exceeds € . 


The results showed an improvement over binary search. Table 4 shows 
that the method of Thirds called the function chebyRemez 764 times as opposed to the 


binary search method that took 924 calls to achieve the same segmentation. 


Other values besides one-third were tested, but they did not perform 
consistently better. Appendix A.2.1, part c shows the portion of the program that applies 
this method. The file name is varQuadApproxTHIRD.m 


Divide Interval by Thirds 











oie 
‘ Initial length 


«—__}__+ L1 = 1/3 of Initial length 


L2= 1/3 0fL1; _ initialize delta=L2/3 


fooaa ee Get below 
segment fooaa ee 





L3 = L2 + delta: 
+———_+ L4=L3 + delta; 
+—_+ L5 = L3 — delta; delta = delta/3 


Loop to converge on 
segment endpoint 














D 
2) 
3 
a 
S) 
§) 


Figure 12. Visual aid for description of divide by thirds algorithm. 


d. Increment by Ratio Numbers 
In this method, the width of the proposed segment is increased or 
decreased by multiplying the current proposed width by a series of fixed values. We 


have the same 2-step process of Locate and Pinpoint. 
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In Locate, the proposed width is the entire remaining width of the domain 
interval [a, b] 1.e. the width from point a to point b. The width is tested to see if the 
constraint has been exceeded or not; except for the last segment, the width will always 
exceed the optimum segment because the entire remainder of the interval is used per 


iteration. As an example, consider that the first segment [a, x,] is already established 
(segment [a, x, ] as shown in Figure 11). Next, the program needs to compute the second 
segment. The program will establish a proposed width [x,, b]. This is the entire 
remainder of the interval. The ratios are applied to the width [x,, b]. The result is 


shorter widths that are tested until the constraint is met. This method is similar to the 
method “Divide by Thirds,” except that, a set of ratios are applied to the 


increment/decrement width. 


Table 4 shows the implementation of increment by ratio numbers took 
1143 calls to chebyRemz function. Appendix A.2.1, part d shows the portion of the 


program that applies this method. The file name is varQuadApproxRatio.m 


e. Estimated Segment Widths (1, 2, 3, more and Average) 
Again, the 2-step process of Locate followed by Pinpoint is applied here. 


In Locate, an estimate of the segment is calculated. 


Equation (0.3) is adapted from [15] to compute segment estimates for 
quadratic approximations. The derivation is in Appendix F. The accuracyé, and the 
third derivative of the function used to estimate the width of the segments. The proposed 
segment widths are tested and the program falls back on the brute force method after the 
initial estimate. This yields a large improvement from using the brute force method 


alone. 


wel 


3é 


ae 
dx* 


max 


EstSegLen = 4 (0.3) 
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One Estimate: In Table 4, when one estimate is used, i.e. the third 
derivative is computed at x = begin point of the segment. The estimated width is added to 
the begin point and the proposed segment is tested. The brute force method takes over 
and single steps to the optimum segment width. The result was 65,400 calls to 


chebyRemez. 


Two Estimates: The first estimated width is calculated using equation 
(0.3) and the third derivative is computed at the begin point of the segment. The resulting 
estimated width is added to the begin point and the resulting endpoint is used in equation 
(0.3) to make a second estimated width using the third derivative at the endpoint. The 
average of these two widths is the estimated width that is applied to the begin point to 
obtain a proposed endpoint. Again, the program uses the brute force method to complete 
the segmentation. This method improved the performance and took 3369 calls to 


chebyRemez. 


Three Estimates: Two estimates are computed as described above. The 
result is divided in half the half-way point is used to compute the third estimate. The 
third estimate is averaged with the other two estimated widths to obtain the proposed 
segment width. As in the other two cases, the brute force method is then applied to 
complete the segmentation. Even further improvement was achieved; 1903 calls to 


chebyRemez. 


Estimates with more than three widths were tested, but the performance 


began to degrade. So, an average was applied to the segments. 


Average of one estimate: In the average method, one estimate was 
computed from the begin point. The estimate was used to define a proposed segment. 
The entire set of points on this proposed width are evaluated using equation(0.3). Then, 
the mean of the resulting vector of estimated widths was computed and used as the 
proposed segment width. The result appeared to be similar (not exact) to taking two 
estimates (when multiple functions are tested, on average the results of two estimates and 
the average method are similar). Table 4 shows that this method called chebyRemez 5972 


times. 


Zi 


Average of three estimates: This method is a combination of taking three 
estimates as described above. All the points on the proposed width are evaluated with 
equation (0.3). This creates a vector of proposed estimates. Next evaluate the mean of 
the vector of proposed estimates to get one estimate. The results of this method are 
similar to taking three estimates. However, since we evaluate all the points on the 


interval, it takes slightly longer. 


In [15], a comparison was made to show the benefit of three estimates 
over two estimates and one estimate in the case of linear approximation. While it is not 
discussed in [15], one estimate was computed in the linear approximation and the 
resulting proposed width was used to compute the mean of all the estimates obtained 
from evaluating all the points on the proposed width. The mean of the estimates was 
similar to taking the mean of just two estimates (begin point and proposed endpoint). In 
the quadratic case, the same method yielded results that were comparable to taking the 
mean of two estimates, just like the results in the linear case. However, when the mean 
of three estimates was used to define a proposed segment and the average of all the 
estimates on the newly proposed width was computed, the result was very close to taking 


the mean of just three estimates. 


Closer analysis revealed that, in many cases, the average of all the points 
worked well and sometimes even better than just the mean of three individual estimates. 
The results appear in Table 5. The first column is the suite of numeric functions 
represented by a number; the focus should be on the comparison, not any particular 
numeric function. The second column is just the three estimates as described above, the 
third column is the average of the estimates calculated using all points on the proposed 
segment. The fourth column is the difference between the second and third column. The 
last column is a method described in part f; Hybrid of Thirds and three Estimates. Table 
5 shows that taking the average of all estimates on the segment has a slight advantage 
over taking the average of just three single estimates. Therefore, looking back, Table 4 
used only one numeric function, and that made it appear that the method of 3Avg was 
slightly worse, whereas in Table 5, we can see that the when applied to the entire suite of 
functions, the average over the entire segment (which was selected after three estimates), 
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was slightly better. The values at the bottom are the sum of all the calls to the 
approximating algorithm, chebyRemez, which was the metric used to determine the 


comparative speed of the program. 












































1 23 29 -6 20 
2 93 103 -10 29 
3 148 146 2 14 
4 133 145 -12 23 
5 83 84 -1 26 
6 90 95 -5 23 
7 266 87 179 59 
8 6326 6210 116 61 
9 128 92 36 35 
10 293 298 -5 98 
11 6233 6203 30 65 
12 925 581 344 172 
13 230 81 149 39 
14 7378 7203 175 95 
15 650 963 -313 222 











Table 5. | Comparison of “3 estimates”, mean of all estimates computed on proposed 
segment that was calculated after taking 3 estimates; “3 average” and a hybrid 
that exaggerates the approximation error by 5%. All cases, N=100,000 
ande=27”. 


The next question is; should we use just three estimates or should we use 
the average of all the estimates computed from all the points on a proposed segment? 
The difference is small. The impact of the additional code that takes the average of an 


29 





entire segment did not exceed the time taken by chebyRemez and did not significantly 


impact the computing time of the program. 


The additional code does not take add significantly to the program and 


since it has advantages, we kept the program that averages the estimates over the entire 


segment. The analysis to support that decision follows: Consider the small section of a 


Profile report from MATLAB that is similar to the one in Appendix D. 


Table 6 shows the total time for varQuadApprox implemented with only 


three estimates. The time for the function, including all child functions is 44.438s. These 


values come from running the program with the function -(x log,x + (1-x) log, (1-x)) ‘ 


N=1,000,000 ande =2™. 





Profile Summary 
Generated 21-Aug-2007 22:25:40 











Function name Calls Total Time Self Time* Total Time Plot 
(dark band = self time) 
multipleQuadA pprox 1 44.906 s 0.156 s | 
varQuadApproxHyb3EstThird 1460 44.438 s 3.516 s | 
chebyRemz 13187 39.156 s 16.406 s ay 
inline.subsref 87050 20.031 s 3.031 s r] 
inlineeval 87202 17.031 s 17.031 s eae 
polyval 69483 3.828 s 3.359 s | 
twosComp 5840 3.000 s 0.188 s 
Table 6. Profile Report for-(x log,x + (1-x) log, (1-x)) , N=1,000,000 andg = 2°. Shows 


44.438s for the varQuadApprox function that averages only three estimates. 


The same function and parameters were run with the additional code that 


takes the average of all estimates over the entire segment. The results appear in Table 7. 


The total time for varQuadApprox, and all its child functions is 20.078s. The additional 
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code to compute the averages took 0.061s which translates to less than 1% of the time 
spent in varQuadApprox. Therefore, the additional code is negligible. This particular 
function clearly shows the advantage of taking the average; greater than 50% 


improvement (44s to 20s). 


It should be noted, that, in a few cases, the improvement was not as 
dramatic and in,/—In(x) , the average code performed worse by 20% (20 seconds to 25 


seconds). However, on average, it was better to take the average over the entire segment. 


A slightly different problem; what happens when the third derivative is 
zero? This presents a problem in the computation of estimates (the third derivative is in 
the denominator of equation(0.3)). Therefore, one way to tackle the problem is to find 
the smallest non-zero, third derivative magnitude over the entire domain interval [a, b] 
and use that to calculate the largest expected segment. This large segment is substituted 
whenever the third derivative is zero. In many cases, the resulting estimate is a poor 
estimate of the segment size, and tends to slow down the program when encountered. 


Therefore, a hybrid of the best segmentation processes was used and is described below. 
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varQuadA pproxHyb3AvegThird (1462 calls, 20.078 sec) 


Parents (calling functions) 


Filename File Type Calls 
multipleQuadApprox M-function 1462 


Lines where the most time was spent 























: Total % Time 
ll : ‘ 

Line Number Code Calls a ene Tne Plot 
p,oscil,errP] = 

98 oe, 1462 4.5315 22.6% 
p,oscil,errP] = 

194 chebyRemz (fct... 994 3.375 s 16.8% Ea 
p,oscil,errP] = 

209 eae. 1001 3.141s 15.6% mm 
p,oscil,errP] = 

182 chebyRemz (fct... 945 2.859 14.2% = 
p,oscil,errP] = 

133 oe. 1010 2.7198 135% mm 

eral 3.4533 17.2% my 

Totals 20.078 s 100% 

= Oi 1461 719 if eed: > length (x_pts) 

0.561 1461 82 Der3Intr = f3der(x_pts (indx:indx+len) ); % Get 
0.03 1461 83 AV3DER = mean(Der3Intr); % 

SO 1461 84 x_range = 4*(epsilon*3/abs (AV3DER))%*(1/3); % Get 

< 0.01 1461 85 len = round(x_range/ (x_ptsRange) *length(x_pts) ); 

< 0.01 1461 86 if lent+tindx > length(x_pts) 























Table 7. Profile Report for-(x log,x + (1-x) log, (1-x)), N=1,000,000 and¢ = 2. Shows 


20.078s for the varQuadApprox function and 0.061s for the average of all the 
estimates on the entire segment. 


f Hybrid of Thirds and 3 Estimates 
In this algorithm, we take advantage of the strengths of two programs. As 
with the other algorithms, we have a Locate and Pinpoint step. However, Locate is a 


combination of Divide by Thirds and 3 Estimates. 
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We know that ¢ is the constraint and that when the approximation is 
good, then a ratio of the maximum approximation error to ¢ should be very close to 1.0. 
This ratio can be used as a metric to determine the quality of our estimate. If the ratio is 
much larger than 1.0, because the segment is too large, then our estimate is too wide. If it 


is much less than 1.0, our estimate is too small. 


To take advantage of the ratio of approximation error andé, the program 
first takes the average of the three estimates and using the estimated width, computes the 
approximation error. If the ratio of the approximation error to ¢ is large (greater than 
1.002) or small (less than 0.9) the program takes the estimated width as a starting width. 
The program then takes a small fraction of that width (5%) and stores it in a variable that 
is used to decrease or increase the proposed width. The algorithm used is Divide by 


Thirds. 


In addition to the steps taken above, the program was modified to 
exaggerate the error calculated from the approximation. This only happens in the final 


steps when trying to Pinpoint the end of the segment. This has two effects: 


(1) It drastically reduces the number of steps required because many of 
the estimations fall short and by exaggerating the error when the segment falls short, you 
reduce the distance that Pinpoint has to travel to exceedé. If you combine the effect of 
saving two or three steps per segment, it adds up to 100 steps if the segmentation 


produces 33 segments. 


(2) Exaggerating the approximation error has the effect of making some 
of the segments slightly smaller than they would otherwise be if the approximation error 
were not adjusted. _ However, remember that the final segment is usually truncated and 
therefore can absorb the extra space created by making the previous segments narrower. 
In a way, by decreasing the size of the each segment by a small amount, it builds in a 
little slack per segment because the approximation error is slightly smaller thane. The 
truncated segment is not optimized and can be increased to accommodate the small 
adjustments in all the other segments. Only in the very high precision segmentation do 


the segments increase noticeably. The increase is on the order of single digits when 
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considering hundreds or thousands of segments. This compromise is acceptable because 
it dramatically reduces the number of calls to chebyRemez as shown in the last columns 
of Table 4 and Table 5. Further, it does not increase the segments by any significant 


amount. 


This hybrid method produces by far the best solution among all the 
algorithms discussed. Consider the function, ,/—In(x) , as shown in Table 4, only 98 calls 


to chebyRemez were needed to achieve segmentation, which is 0.0098% of the steps that 


brute force would take when N=/,000,000. 


C. MATLAB RESULTS 

MATLAB was used to segment the numeric functions into piecewise quadratic 
segments. The uniform and non-uniform segmentation, number of segments required for 
each of the numeric functions and a comparison of the segmentation algorithms have 


been discussed in part B above. 


The coefficients that represent the piecewise quadratic approximation for the 
segments are computed and stored in a file. These files can store the coefficients and 
segment boundaries in hexadecimal, binary or decimal form. The NFG implemented in 
the floating point number representation, uses the coefficients saved as decimal values. 
However, when the NFG is in fixed point number system, the coefficients saved are 


hexadecimal values. 


Table 8 shows the data in the memory file for the non-uniform segmentation of 


cos(zx). At the top of the memory files is a decimal number that states the number of 


segments in the memory file. This is useful when reading the file to determine how many 


elements need to be read into the program. 
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460 
0.0046 10004610 
0.007928007928 
0.010830510831 
0.013492513493 
0.015989015989 


-4.934645942292 
-4.93383 1217369 
-4.932649394425 
-4.931191898804 
-4.929503741 104 


-0.000000373 180 
-0.000007964422 
-0.000026748228 
-0.000058351444 
-0.000103932118 


1.000000000116 
1.000000018030 
1.000000092899 
1.000000264447 
1.000000572352 


460 

0x000000970£858467 
0x00000103c8£362£9 
0x00000162e4e8e873 
0x000001ba1l£681879 
0x0000020bed96624E 


Oxf££d885d8592426b 
Oxff£d887837fab57d 
Oxff£d889ef1d427ca 
Oxff£d88ceb4302ae2 
Oxff£d8906057b39b1 


Oxfffffffffcde9al6 
Oxffffffffbd3088d1 
Oxffffffff1£F9e9e52 
Oxfffffffel6833aat 
Oxfffffffc982779e8 


0x0000800000003f £4 
0x000080000026b814 
0x0000800000c77££1 
0x000080000237e533 
0x0000800004cdldc1 


Table 8. |. Sample memory-files (Decimal and Hexadecimal). Non-uniform segmentation 


of cos(zx) , N=1,000,000 andg =2°”. 


The first column shows the segment end points. The next three columns are the 
coefficients of the quadratic polynomial that determines values in the segment. The order 


is c,, c, and c, from left to right. Equation (0.4) shows the relationship of the 


coefficients to the polynomial. 


f(w=p =e +c'X+C, (0.4) 


The hexadecimal values in Table 8 use a fixed point number system, where the 
first 17 bits are the integer including a sign bit and the last 47 bits are the fraction. The 


number is a two’s complement number. The number system is discussed in section III. 


D. SUMMARY 


MATLAB is used to segment the suite of functions in Table 1. The segmentation 


algorithm results in the fewest segments for a given accuracy constraint. In each segment 


35 


the minimax quadratic approximation is achieved by computing the coefficients using the 
Remez algorithm which performs better approximation than MATLAB’s available 


function; Polyfit 


The Remez algorithm is slow; therefore various methods were investigated to find 
an efficient algorithm to compute the segmentation of the numeric functions. A hybrid of 
three algorithms is chosen as the best algorithm to compute fast segmentation of the suite 
of functions. Table 4 uses only one function, but summarizes the results of the 


comparisons. 


Quadratic segmentation at high accuracy (2~°) results in over 96% fewer 


segments, compared to linear approximation as shown in Table 2. 


The segmentation is the first step to building the NFG. Next the circuit has to be 
designed in hardware. In section III, we look at the components that make up the NFG 


circuit. 
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Hl. NFG CIRCUIT 


A. CIRCUIT OVERVIEW 
Figure | is duplicated here from section I for convenience. Figure | shows three 


multipliers, the segment index encoder, coefficients table and one 3-input adder. These 


are the hardware components for the NFG. 


input X 





Multiplier Segment 
XX Encoder 


Coefficients Table 






Multiplier Multiplier 
os Cex 






fQ)=CO NX 4OX4+C, 


Figure 1. NFG Overview (duplicated from Section I). 
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The architecture has three 64 bit multipliers and one 3-input 64 bit adder. The 
adder and multiplier can be implemented in two’s complement or floating point by using 
the prescribed math operators. To generate a floating point multiplier or adder, the 
operands need to be declared as doubles or floats. To generate a two’s complement 


multiplier or adder, each operand needs to be declared as an integer, e.g. int64_t or int. 


The segment index encoder is designed using a priority selector macro supplied 
by SRC and provided as a user callable macro. In uniform segmentation, multiplying by 


a segment density number can obtain the desired index. 


i Bs Number System 
To determine the number system to use, we need to know the range of values the 
NFG will have to handle. An analysis of the domain, range and coefficients provides the 


boundaries for the number system. 


Table 9 shows the analysis of the numeric functions. The numeric functions have 
been ordered to show the most demanding to the least demanding. At the top, ,/—In(x) 


requires 15 bits to accommodate any integer value the hardware may encounter, based on 


the range of values and coefficients. 


The columns, Max and Min are the maximum and minimum values among all 
coefficient values, all possible domain and range values, i.e. any number that would 
appear in the computation done by the NFG. The column labeled log2 (abs(largest one)) 
is obtained by comparing the absolute value of Max and Min and choosing the larger. 
We then compute the logarithm base 2 of this value. The final column shows the 
maximum number of bits required to represent the largest possible integer the NFG may 
encounter. Note that these values have been computed for a specific domain and 
different domains may require more or less bits. Table 2 shows the domains for each of 


the numeric functions that appear in Table 9. 


The NFG requires at least 15 bits to represent the largest integer that may be 
encountered when computing the approximation of a numeric function. Therefore, the 


number system chosen is 16 bit integer and 16 bit fraction (i.e. 32 bit implementation). A 
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64 bit implementation has 32 bit integer and 32 bit fraction. The decimal point in the 
two’s complement number system is interpreted to be between bit 32 and bit 31 in a 64 


bit number when the LSB is 0. 

The 64 bit implementation benefits from using a 16 bit integer and 48 bit fraction, 
however the number of segments required is very large and these implementations were 
not investigated in detail. As an example, cos(zx) at ¢= 2” and N=5,000,000 would 


require 19,167 segments. 




































































[-In(x) 24047.26212| -196.4301496 | 14.55358503 15 
-(x logox + (1-x) logo(1-x)) |360.5900787| -185.0149295 | 8.494215892 9 
tan’ (zx) +1 78.89563478| -26.88144904 | 6.301873574 7 
sin(e*) 94.22597144| -96.6450472 | 6.594623895 7 
tan(zx) 19.70724959| -3.570442576 | 4.300654538 5 
In(x) 4.934751084| -4.934751014 | 2.302977315 3 
sin(tx) 1.569925541 | -4.934645908 | 2.302946566 3 
cos(zx) 1.569925541 | -4.934645908 | 2.302946566 3 
1/x 2.997676487| -2.995354324 | 1.583844694 2 
log, (x) 2.882537585| -2.162615784 | 1.527339419 2 
2* 1.093679242| 0.004061004 | 0.129189682 1 
ale 2 -0.124634328 1 1 
hie 2 -1.247861112 1 1 
| = 1.414213562| -0.414997832 0.5 1 
27 
1 1 -0.045379009 0 0 
l+e* 








Maximum and minimum values encountered for each function in the NFG 
computation. Last column is the number of bits required for the integer portion. 


Table 9. 
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2% 16, 32, 64 Bit Accuracy vs. 16, 32, 64 Bit Architecture 

The accuracy and architecture can be built to match each other. Consider a set of 
values of 16 bit accuracy. Based on the number system, we would need 16 bits for 
integer and 16 bits for fraction (which is the accuracy). An architecture that matches 
these needs has to have 32 bit words; the architecture would be 32 bits. One 
implementation in the NFG was designed this way. Another design was built with 32 bit 
accuracy (32 bits fraction and 32 bits integer) and therefore the width of the architecture 


is 64 bits. 


Another way to build the NFG is to use 64 bit architecture for all accuracies. This 
means that all values will be represented in 64 bits. Consider a value that is accurate to 
16 bits. In this case, 32 bits are available to represent the fraction, but the fraction will 
only be accurate to 16 bits. The rest of the bits are irrelevant, but the hardware operates 


on all 64 bits. The architecture, in this case, does not match the accuracy. 


B. CIRCUIT COMPONENTS 

1, Segment Index Encoder 

The segment index encoder accepts input (x) values (within the domain of the 
NFG) as inputs and outputs a number used to obtain the quadratic coefficients. The 
number is an index to the segment that x belongs. This only applies to the non-uniform 


segmentation. 


User callable macros available in the SRC are used to implement a priority 
selector in the NFG. The prioritized selectors work as an “‘if-else-if’ sequence. A wide 
number of options are available for 8, 16, 32 and 64 bit wide values. Each of these bit 
widths options can be implemented with 4, 8, 16, 32, 64, 128 or 256 elements. For 
example, choosing 64 bits and 256 elements, is equivalent to a priority encoder of 256 64 


bit words. 


The prioritized selector requires a Boolean condition and an assignment for a true 
condition. In the NFG, the Boolean condition is the comparison of the segment endpoint 


to the input value (numeric function argument; x). If x is less than the segment endpoint, 
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then x belongs to that segment and the corresponding assignment value is the index of the 
segment. Since x lies in the chosen segment, the index of the segment is used to access 


the polynomial coefficients that approximate the numeric function in that segment. 


The types of selectors for a given segmentation are carefully chosen so as not to 
use more FPGA area than necessary. For example, consider a numeric function that has 
been segmented into 48 segments. The only selector that would accommodate this 
number of segments is the 64 element selector or greater. The 64 element selector can 
handle another 16 elements. However, since we do not need them, the whole selector 
wastes 48 elements. A better approach is to make two smaller selectors out of one 16 
element selector and one 32 element selector. This saves FPGA area and allows us to 


build the selector we need. An example of the described code is provided in Table 10 








//--Select Which Switch Statement will b xecuted // 
if ( varx <= 0.333333333333333310) 














sel = 1; 
else if ( varx <= 0.500000000000000000) 
sel = 2; 
// Switch Statement iY, 


switch (sel) 
{ 
case 1: 
select_pri_64bit_32val( varx <= 0.010351035103510351, 0, 
varx <= 0.020802080208020803, 1, 
varx <= 0.031203120312031204, 2s 


varx <= 0.322882288228822870, 30, 
Ss &indx) ; 
break; 
case 2: 
select_pri_64bit_l6val( varx <= 0.343734373437343750, 32, 
varx <= 0.354135413541354140, 33, 
varx <= 0.364586458645864590, 34, 


(=) 


varx <= 0.479147914791479170, 45, 
varx <= 0.489598959895989620, 46, 
47, &indx) ; 
break; 


} 





Table 10. Code that uses two selectors to implement 48 segments. 
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To implement a larger than 256 selector, a combination of available selectors can 
be used. In the .mc file, an if-else-if statement precedes the set of selectors and selects 


which one of the selectors will be used to encode the index. 


More detail on the various selectors available in the SRC, is provided in Appendix 


A.10 of [17]. 


2. Indexing in Uniform Segmentation 

In uniform segmentation, a number that is multiplied by the input value, x, is used 
to compute the appropriate segment; essentially, a segment number density. It represents 
the number of segments per unit length. Instead of a segment index encoder, x is 
multiplied by the segment density number and the integer result is the index that is 
applied to the coefficients’ arrays to access the coefficients for the quadratic 


approximation. 


The segment density number is obtained by dividing the entire interval by the 


number of segments and inverting the result. 


For example, consider an interval, [0, 0.5] with uniform segmentation. If 100 





Sate 
segments are realized, then the number used to multiply all inputs i F = 200. If 


the input is 0.3356, then the coefficients will be extracted from the OBM array using the 
index 67 ( floor (0.3356 x 200 = 67.12) = 67). 


If the interval of the domain starts at a non-zero value, then the index obtained 
from the above method will be offset. Simply subtract the offset from the index obtained 
to get the true index into the array. This extra step increases the pipeline depth of the 
NFG. The effect is greater in floating point implementation compared to fixed point 


implementation. 
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a. Floating Point Implementation 
The uniform segmentation of the NFG in floating point requires three 
files; main.c, <subroutine>.mc and memoryFile. An array containing floating point 
values of the endpoints and coefficients of the uniform segmentation are passed into the 
OBM, via a DMA call. The sample points for testing the NFG are placed in a separate 
array and passed into OBM via a second DMA call. The memory file contains three 
numbers at the beginning of the file: 
e The number of segments (which is also the number of sets of 
coefficients in the memory file). Stored as an int. 


e The segment density number that is used to determine the 
segment that any x input belongs to. Stored as a double. 


e The offset value (needed for functions that have an interval 
with a non-zero begin point) 


b. Fixed Point Implementation 
The uniform segmentation, fixed point implementation, works similar to 
the floating point implementation. Three files are needed; main.c, <subroutine>.mc and 
memoryFile. The coefficients in the memory file and in the computation are two’s 
complement hexadecimal numbers, as described in the section on number systems. The 
memory file contains three numbers at the beginning of the file: 
e The number of segments (which is also the number of sets of 
coefficients in the memory file). Stored as an int. 


e The segment density number that is used to determine the 
segment to which any x input belongs. Stored as an int64_t. 


e The offset value (needed for functions that have an interval 
with a non-zero begin point) 


The computation of the index, and therefore, the segment, is accomplished 
in two’s complement. One major problem exists in this multiplication; the product is 128 
bits, but the architecture only allows 64 bits to be stored. This means the upper 64 bits 
are truncated. In addition, since the decimal point in the operands is 32 bits from the 


LSB, the decimal point in the product is between bit 63 and bit 64 (when LSB is 
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considered to be bit 0). This means we lose all integer values and the entire product that 


is stored is only the fraction portion of the true 128 bit product. 


To represent the full range of numbers in the numeric functions, we need 
to retrieve some of the upper bits. The segment density number is normally a whole 
number (without value in the fraction); occasionally the segment density number may 
have a small but negligible fraction. We can perform a 16 bit logical shift right to the 
segment density number without a large loss. This opens up 16 bits in the integer part of 
the product; which is really the index into the array of coefficients. 16 bits is enough to 
represent over 65,000 segments. The product is then shifted 48 bits to the right to give 
an index number (index numbers must be whole numbers). This method is prone to 


rounding errors which occasionally result in the wrong index. 


Other schemes have to be implemented when both operands have a 
significant amount of data in the fraction. The section on the two’s complement 


multiplier discusses other schemes in more detail. 


3. Coefficients Table 

The coefficients to the quadratic equation for each segment are stored in an array 
in the OBM banks on the MAP® board. The segment index encoder provides an index 
into the array. The coefficients are accessed and applied to the quadratic equation along 


with the x value that is being evaluated. 


4. Multiplier 
The three multipliers shown in Figure 1 are either implemented in two’s 
complement or floating point. Floating point operations increase the pipeline depth, but 


are easier to code. 


2 The largest number of segments is 34,483, which is the uniform segmentation of ,/—In(x) , 
when € = 2”. Table 12 shows the number of segments for various functions when using uniform 
segmentation. 
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a. Floating Point Multiplier 
The floating point multipliers implemented in the NFG are implicitly 
instantiated. The operands are declared as doubles and when the multiplier operator in 


the .mc file was applied, the MAP® compiler builds the floating point multiplier. 


b. Two’s Complement Fixed Point Multiplier 


The three main categories of interest are: 


e Fixed point two’s complement multiplier 
e Floating point multiplier 


e Signed Magnitude multiplier 


The signed magnitude multiplier was not built. The fixed point multipliers 
implemented in the NFG are either implicitly instantiated or explicitly built in HDL. The 
two’s complement fixed point multiplier was built in Verilog, VHDL and implicitly 


instantiated by the SRC MAP® with various levels of success. 


To implicitly instantiate the two’s complement multiplier, the operands are 
declared as integer values (int64_t) and when the multiplier operator in the 


<subroutine.mc> file is applied, the MAP® compiler builds the appropriate multiplier. 


This method has two major problems; (1) The SRC 64bitx64bit multiplier 
does not result in a 128 bit product. Instead, it results in a 64 bit product that is 
composed of only the lower 64 bits. (2) If the MSB at the cutoff is a binary 1, the 


number appears as a negative number, even though it is really a positive number. 


Because of the number system chosen, i.e. 32 bits of integer and 32 bits of 
fraction, multiplication results in a product that represents only the fraction portion of the 


multiply; the integer portion, bits 65 through 128, are truncated. 


One way to overcome this limitation is to choose a different number 
system that has fewer bits to represent the fraction, but this reduces the accuracy of the 
NFG and it still limits the size of the integer. The integer must be at least 16 bits to 


provide full coverage of the values encountered in the suite of functions in Table 1. One 
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implementation of the NFG was built by shifting the operands right 8 bits, before the 
multiply. This allowed for 16 bits to be represented in the integer portion of the product. 
In this case, the best accuracy that one would expect to attain is 24 bits, ie. 2™. Due to 
truncating the operands, error is propagated to the output and the accuracy is not reliable. 
Shifting values presents another problem, because, if the MSB is a binary 1, then the right 
shift operation will sign extends the number. This has unwanted effects. A product may 
be positive, but if the bit right before the cutoff point is a binary 1, the shifted values will 
be sign extended and we have to zero out the leading bits. More detail on the results of 


this method can be found in section V where the implementation results are covered. 


The best solution is to build an HDL multiplier that can compute the result 
in the number system chosen and therefore keep the desired accuracy and the best range 
for the integer without any sacrifices to accuracy. The problem with this method is that is 


requires a long carry chain. 


Verilog or VHDL can be used to explicitly build the multipliers. Several 
multipliers were built in VHDL and Verilog. The HDL files do not meet the timing 
requirements while running the NFG, although the program compiles without any errors. 
Simulation using Modelsim and Xilinx ISE showed that the design for the multipliers 
was correct. The problem appears to be the carry chain that is required to add all the 


partial products. 


Further investigation is needed to determine if indeed the problem is in the 
carry chain and if a carry save adder (CSA) followed by a carry lookahead adder (CLAH) 


are required. (Which were not built) 


2. Adder 
The NFG required a 3-input adder. As in the case with the multipliers, floating 
point and fixed point adders are instantiated by the MAP® compiler. 
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C. SUMMARY 

The NFG circuit requires three multipliers and one 3-input adder. Floating point 
implementation is easier than the fixed point implementation, but requires more 
hardware. The multipliers can be instantiated implicitly or in the case of fixed point, the 


user has the option to explicitly build the multiplier in HDL. 


Fixed point arithmetic presents some challenges with rounding and truncating of 


the operands and results. 


The circuit design was built on the SRC-6E reconfigurable hardware. Section IV 
provides a background on the SRC-6E system to give a better understanding of the 


hardware and software system. 
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IV. SRC BACKGROUND 


A. INTRODUCTION 
The late Seymour Cray established SRC Computers Incorporated in Colorado 
Springs, Colorado in 1996. SRC developed the IMPLICIT+EXPLICIT™ architecture 


that is designed to provide increased performance over conventional processors [16]. 


1. IMPLICIT+EXPLICIT™ Architecture 

The IMPLICIT+EXPLICIT™ architecture allows the full integration of Dense 
Logic Device (DLD) technology such as ASIC devices or microprocessors with 
reconfigurable Direct Execution Logic (DEL). SRC’s Carte™ Programming 
Environment lets the programmer choose that part of code that executes in the fixed logic 
(i.e. microprocessor - implicit) and that part that executes in the reconfigurable hardware 


(explicit) [16]. Figure 13 is an overview of the SRC IMPLICIT+EXPLICIT™ 





architecture. 
Fortran —~ Carte™ Programming Environment ~— C 
implicitly Controtied Device implicit Explicit Explicitly Controlled Device 
— Dense logic device Device Device — Direct execution logic 


— Higher clock rates 
— Typically fixed logic 
-mP, DSP, ASIC, etc 


— Lower clock rates 
— Typically reconfigurable 
— FPGA, CPLD, OPLD, etc 


Memory 
Control 


Memory 





Unified Executable 


Figure 13. IMPLICIT+EXPLICIT™ architecture [16]. 
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The user can program in the Carte™ Programming Environment in C or 
FORTRAN instead of designing logic. A single executable is generated that specifies 
which operations execute on which parts of the system. If the programmer desires to 
design the logic, he/she can design in a schematic capture program and generate VHDL 
or Verilog files that are used as macros. The user can also code the Verilog and VHDL 
files and use them as macros. More information on what is needed to implement macros 


is provided in the section on software code [16]. 


B. HARDWARE 
Figure 14 shows 3 Xilinx XC2V6000 FPGAs on the MAP®, 2 sets of memory and 
some ROM. 


1400 MB/s 1400 MB/s 
sustained sustained 
payload payload 





WM [fod gelotele (=) 
ROM 
| 
Config 
A 4800 MBis ROM 
(Cea) 
Six Banks 
DIVE] Br elelac-te| 
On-Board Memory 
(24 MB) 
A 
4800 MB/s 4800 MB/s _ 
wy (6x 64b) (6x64b) J 
4800 MB/s 
User Logic 1 q 192b User Logic 2 
p Covalsailit BITE: Bele) accce| pr devardeitltt) 
AA AA 
WV - 2400 
GPio MB/s 
each 
Figure 14. MAP® Hardware overview diagram [18]. 
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There are three FPGAs. The user can program two of the FPGAs, while the third 
is used as a controller. The FPGAs are Xilinx Virtex II’s, XC2v6000 with a -4 speed 
grade. There are 6 banks of dual ported On-board Memory (OBM) with a total of 24 MB 
(high-speed local memory). The OBM RAM is connected to the two user logic FPGAs 
via a 4800 MB/s (OBM RAMs is also connected to third FPGA via another 4800 MB/s 
bus). 


The two FPGAs are connected directly to each other with access to a 4 MB dual 
ported memory bank for inter-chip data exchange on a 4800 MB/s bus. The two FPGAs 
have two General purpose I/O (GPIO) ports for direct data off the MAP® that is 
connected via a 2400 MB/s bus. 


Internal to each user FPGA is an additional 144 BRAM 18KB blocks [19] for a 
total of 2,592 KBs of BRAM. BRAM is fast since it is on the FPGA chip. 


C. SOFTWARE CODE 
A user program consists of two C programs, main.c and <subroutine>.mc as well 


as “helper” files. 


1. main.c 
The main routine is a C program that runs on the SRC’s Intel processor. The 
main routine contains the declarations for the subroutine functions and makes the 


subroutine functions visible to the Intel processor. 


To effectively use the MAP® hardware, we need to partition the code and select 
the portions that will provide improved overall performance when executed on the MAP® 
processor. These include loops that can be pipelined, or manipulation of bits that are in a 
long bit stream of data [20]. They are placed in a C program described in the next 


section. 
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2. <subroutine>.mc 

These are the files that contain the function subroutine that is called from the main 
routine to execute on the MAP® boards. The code in the .mc files should not contain any 
external calls outside the MAP® with the exception of SRC-defined or user-defined 


macros. 


The .mc file does not allow any system calls or runtime functions that require 
intervention from the operating system. The only exception is the printf statement which 
is ignored during compile time except in debug mode; the printf statement is very handy 
in the debug mode. This means that .mc cannot contain any additional system header 
files besides the libmap.h header file, which is the only runtime library allowed in the 


MAP® [16]. 


3. Makefile 

Many files are used during compilation. The Makefile identifies the files and 
commands that are used by the compiler. The Makefile allows the programmer to set the 
source code preprocessing environment variables, C compiler flags, MAP® compiler 
flags and simulation compiler flags [16]. SRC provides a template that can be tailored 


for the specific needs of the program. 


4. Macros 

Macros allow the programmer to design in HDL. It is more flexible than just the 
<subroutine>.mc file alone. Macros allow the programmer the flexibility of creating 
specific and unique hardware that can manipulate wide bit values and all the way down to 


single bits. 


To implement a macro, the Makefile needs to know where to find the HDL files 


and the macro support files. The following are required for macros: 
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a. info 

The info file provides the MAP® compiler with the name of the macro and 
the relationship between the call and the macro instantiation. The info file defines the 
name, characteristics (such as whether the macro is pipelined), whether it interacts with 
external systems (outside the code block), the latency of the circuit specified by the 
macro, the number of inputs and outputs. The signal names and macros in the Verilog 
code that is generated by the MAP® compiler requires the info file in order to correctly 


map the operators and calls in the source program [16]. 


The info file can also be used to define the behavior, in C, that the 
hardware is expected to perform. This feature is available for the debug mode and uses 
the Intel processor to emulate the hardware that the programmer intends to design on the 


MAP®. 


If multiple macros are used, the user only needs one info file. The 


information associated with the different macros must be put into the one info file. 


b. blk.v 
The black box file, blk.v, describes the macros interface. It is a simple file 


that tells the number of bits for each input and output and is described in a Verilog-style. 


If multiple macros are used, the user must add the interface information 


into a single bik.v file. 


c. HDL Files 
The HDL files can be written in VHDL or Verilog. They are specified in 
the Makefile. 


d. Location for NGO Directory 
This location must be specified in the Makefile to identify the directory 
that will contain all the NGO files. The recommended practice is to put the NGO 
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directory in the same directory with all the macro information, and include the info, blk.v 


and HDL files. 


The macros describe the logical design at a high level. The NGO files are 
used by NGDbuild to create an NGD file. The NGD file describes the logical design in 


terms of Xilinx primitives (basic elements in the FPGA). 


D. SUMMARY 
The SRC system provides flexibility, and a user-friendly interface for designing 


specialized hardware. 


Various implementations of the NFG were built on the SRC system. The results 


are documented in section V. 
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V. IMPLEMENTATION RESULTS 


A. UNIFORM SEGMENTATION 
Uniform segmentation is easier to implement in terms of programming the 
<subroutine>.mc file. Appendix C shows the code main.c and subr.mc for uniform and 


non-uniform segmentation. 


1. Floating Point Implementation 

Two major advantages of the uniform segmentation floating point implementation 
are (1) the multiplier does all the work of moving the decimal point and (2) once the file 
is compiled, any function can be computed without having to recompile. The only 


requirement is to change the memory file. 


The disadvantage is that floating point operations require much hardware. The 
complexity of using floating point is hidden from the user, but is evident in the amount of 
multipliers consumed and the pipeline depth required. Figure 15 shows the summary 


report after the compile process is completed; (i.e. after the user types make hw ). 


































































































































































































HEHEHE EEEE REE HEGRE HE EERE HEE HE EERE HEE HE HEHE HE EH EH HEHE 
HEHEHE EHEE HEE HEH FH INNER LOOP SUMMARY HHEEHHEH EHH 
loop on line 55: 
clocks per iteration: 1 
pipeline depth: 84 
HEHEHE E REE HEGRE EERE HE EREE EERE HEE HE EEE EH EEE HE HEHE 
HHEEHEE EE HE HEHE EH PLACE AND ROUTE SUMMARY HHEEE HEHEHE 
Number of Slice Flip Flops: 17,647 out of 67,584 26% 
Number of 4 input LUTs: 9,299 out of 67,584 3% 
Number of occupied Slices: 11,390 out of 33,792 33% 
Number of MULT18X18s: 64 out of 144 44% 
freq = 100.2 MHz 
HEHEHE EEREEE RES HEGRE EERE EEE EERE HEE EE EEE HEE HEHE HEH 
































































































































Figure 15. NEG Pipeline depth and place and route summary. 
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The SRC has user callable macros that are summarized in Appendix A of [20]. 
Figure 16 shows the difference between the pipeline depth of the NFG and the SRC user 
callable macro. The pipeline depth for the NFG is a 20% less than that of the user 


callable macro. 


Figure 16 also shows the place and route information associated with mapping 
both the NFG and SRC’s user callable cosine macro. Comparing Figure 15 with Figure 
16, one can see the hardware requirements have increased due to adding SRC’s user 


callable macro. 





















































































































































































































































HHPHEHEEEEEEEEREE HEGRE EE ERE HERE E EGE HEE HE HEHE HEE HE HE HEHE 
HEEHEEE EEE HEE HE HEE INNER LOOP SUMMARY HHPHHEHEHHH 
loop on line 55: 
clocks per iteration: 1 
pipeline depth: 84 
loop on line 72: 
clocks per iteration: 1 
pipeline depth: 105 
HEHEHE EEEEEEEREE HEGRE HE EERE HEE HE EERE HE EEE HE EEE HEE HEHE HEH 
HHEFHHHEEEHEH HE EHH PLACE AND ROUTE SUMMARY HEHEHE HE HHH 
Number of Slice Flip Flops: 27,557 out of 67,584 40% 
Number of 4 input LUTs: 17,318 out of 67,584 25% 
Number of occupied Slices: 17,862 out of 33,792 52% 
Number of Block RAMs: 1 out of 144 1% 
Number of MULT18X18s: 92 out of 144 63% 
freq = 100.0 MHz 
HHEPHEHEEEEEEEEREE EERE E EERE HERR EE EERE HEE HE HEHE HEE HEHE HEH 











































































































Figure 16. Pipeline depth (NFG and SRC Cosine Macro). Place and route summary. 


Table 11 shows a comparison of the hardware used to build the NFG, the macro 
and both on the same FPGA. The comparison shows that the NFG approximation is 
close to the macro in terms of hardware needed; with the exception of the multiplier. The 


NFG requires a slightly more than double the multipliers that the macro requires. 
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NFG Alone Macro Alone | NFG & Macro 
# of Slice Flip Flops 26% 21% 40% 
# of 4 input LUTs 13% 14% 25% 
# of occupied Slices 33% 27% 52% 
# of Block RAMs 0% 1% 1% 
# of MULT18X18s 44% 19% 63% 
Freq 100.2 MHz 100.1 MHz 100.0 MHz 





Table 11. Comparison of NFG uniform segmentation and macros: NFG alone, Macro alone 
and both (function iscos(zx). Implementations without offset. 


The implementation described above applies to functions that have a domain 


interval that starts at zero. If the interval starts at a non-zero value, then the index 


computed needs to be adjusted by an offset value. 


requirements when the offset is applied. 


Figure 17 shows the hardware 



























































































































































































































































































































































HHEPHEEEEEEEEEREE EERE EERE EE ERE HEE HE EERE HEE HE HE EH EH HEH 
HEHEHE EHE HE HH EH EH FH INNER LOOP SUMMARY HEHE HHH 
loop on line 56: 
clocks per iteration: 1 
pipeline depth: 98 
loop on line 74: 
clocks per iteration: 1 
pipeline depth: 127 
# HHFHHHEEEEEEEEEEEEEEREE ESE EEE HE EEE 
# PLACE AND ROUTE SUMMARY # 
Number of Slice Flip Flops: 29,306 out of 67,584 
Number of 4 input LUTs: 20,678 out of 67,584 
Number of occupied Slices: 20,125 out of 33,792 
Number of Block RAMs: 1 out of 144 
Number of MULT18X18s: 72 out of 144 
freq = 100.0 MHz 
HHPHEHEEEEEE EERE EERE EERE EEE EE ERE HE EEE HE EEE HE EH EH HEH 
























































































































































Figure 17. Pipeline depth (NFG and SRC ./—In(x) implemented in macros). Place and 


route summary with subtraction hardware included for computing offset (when 
finding the index. of coefficients). 


The adjustment is a subtraction operation. In the floating point number system, 


the hardware required to perform arithmetic computations is large and by adding a 


a1 








subtraction computation, the NFG pipeline depth increases from 84, as shown in Figure 


15 and Figure 16, to 98 as shown in Figure 17. 


Figure 18 shows the comparison between the output of the macro and the NFG. 
The macro computes using float values, while the NFG can compute higher precision 
values. Therefore, a user can achieve a shorter pipeline depth and higher accuracy by 
using the NFG. The cost of using the NFG is that the user must have a memory file to 


load the coefficients of the quadratic approximation into OBM. 


Figure 18 shows the comparison of the results from the NFG that uses a memory 
file with the coefficients computed with an accuracy of¢ =2~”. This implementation has 


459 segments and an accuracy of 32 bits. 


The first labeled column in Figure 18 is, x values, which shows the values of x, 
which in this case are the endpoints. Based on the Remez algorithm, the end points, 
begin points and two other points in the middle of each segment have the worst case 
approximation error. Therefore, we expect to see the error of these points to be very 


close to the maximum error allowed for the segmentation 


ie.€ =2-° =1.1641532... x10° (essentially at the 10" decimal place). 


Excel and MATLAB are used to computecos(zx). The results for Excel and 
MATLAB are exactly the same as shown in Figure 18, in the column labeled Excel- 
MATLAB (difference of the results is zero). The NFG output and the SRC cosine macro 
are compared to Excel and the results are shown in the last two columns. Figure 18 
shows that SRC’s macro is accurate tog = 2’, which is the correct accuracy for floating 
point values. The NFG is accurate to within2~’. This accuracy can be increased without 
an increase in FPGA hardware, if desired. The cost is OBM memory to store a larger 


coefficients table. 
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x Values 
0.00089400089400090 
0.001 788501788501 80 
0.00268300268300270 
0.00357750357750360 
0.00447200447200450 
0.00536600536600540 
0.00626050626050630 
0.00715500715500720 
0.00804950804950800 
0.00894400894400890 
0.00983850983850980 
0.01073301073301070 
0.01162751162751160 
0.01252201252201250 
0.01341651341651340 
0.01431101431101430 
0.01520501520501520 
0.01609951609951610 
0.01699401699401700 
0.01788851 788851790 
0.01878301878301880 
0.01967751967751970 
0.02057202057202060 
0.02146652146652150 
0.02236102236102240 
0.02325552325552330 
0.02415002415002410 
0.02504402504402500 
0.02593852593852590 
0.02683302683302680 


Figure 18. 


ysubr 


ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 
ysubr: 


NFG OUTPUT 
: 0.999996055923 
0.99998421 4899 
0.999964477019 
0.999936842441 
0.999901311383 
0.999857910602 
0.999806591898 
0.999747377743 
0.999680268602 
0.999605265006 
0.999522367549 
0.999431576883 
0.999332893727 
0.999226318859 
0.999111853121 
0.998989497418 
0.998859327721 
0.998721199455 
0.998575184308 
0.998421283434 
0.998259498047 
0.998089829425 
0.997912278907 
0.997726847897 
0.997533537859 
0.997332350318 
0.997123286864 
0.996906472609 
0.996681666744 
0.996448990104 


ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 
ySRCMacro: 


SRC MACRO 
0.999996066093 
0.999984204769 
0.999964475632 
0.999936819077 
0.999901294708 
0.999857902527 
0.999806582928 
0.999747395515 
0.999680280685 
0.999605238438 
0.999522387981 
0.999431550503 
0.999332904816 
0.999226331711 
0.999111831188 
0.998989522457 
0.998859345913 
0.998721 182346 
0.998575210571 
0.998421311378 
0.998259484768 
0.998089849949 
0.997912287712 
0.997726857662 
0.997533559799 
0.997332334518 
0.997123301029 
0.996906459332 
0.996681690216 
0.996448993683 


Excel Cosine 

0.999996055923 
0.999984214899 
0.999964477020 
0.999936842442 
0.999901311383 
0.999857910603 
0.999806591900 
0.999747377745 
0.999680268604 
0.999605265009 
0.999522367552 
0.999431576887 
0.999332893731 
0.999226318863 
0.999111853126 
0.998989497423 
0.998859327726 
0.998721199461 
0.998575184314 
0.998421283440 
0.998259498053 
0.998089829432 
0.997912278915 
0.997726847905 
0.997533537867 
0.997332350326 
0.997123286873 
0.996906472618 
0.996681666753 
0.996448990113 


MATLAB 


0.999996055923 
0.999984214899 
0.999964477020 
0.999936842442 
0.999901311383 
0.999857910603 
0.999806591900 
0.999747377745 
0.999680268604 
0.999605265009 
0.999522367552 
0.999431576887 
0.999332893731 
0.99922631 8863 
0.999111853126 
0.998989497423 
0.998859327726 
0.998721199461 
0.998575184314 
0.998421283440 
0.998259498053 
0.998089829432 
0.997912278915 
0.997726847905 
0.997533537867 
0.997332350326 
0.997123286873 
0.996906472618 
0.996681666753 
0.996448990113 


Excel-MATLAB 


0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 
0.000000000000000000 


ySRCMacro - Excel 


0.00000001017031 
-0.00000001012988 
-0.00000000138820 
-0.00000002336516 
-0.00000001667436 
-0.00000000807656 
-0.00000000897243 

0.00000001777088 

0.00000001208112 
-0.00000002657 168 

0.00000002042948 
-0.00000002638399 

0.00000001 108489 

0.00000001284753 
-0.00000002193770 

0.00000002503456 

0.00000001818690 
-0.00000001711431 

0.00000002625699 

0.00000002793842 
-0.00000001328536 

0.00000002051732 

0.00000000879730 

0.00000000975711 

0.00000002193241 
-0.00000001580801 

0.00000001415636 
-0.00000001328664 

0.00000002346290 

0.00000000356950 


Float Accuracy 
32 Bit Accuracy 


-0.00000000000016 
-0.00000000000049 
-0.00000000000081 
-0.000000000001 14 
-0.00000000000001 
-0.00000000000178 
-0.0000000000021 1 
-0.00000000000195 
-0.00000000000276 
-0.00000000000307 
-0.00000000000339 
-0.00000000000372 
-0.00000000000406 
-0.00000000000436 
-0.00000000000469 
-0.00000000000501 
-0.00000000000535 
-0.00000000000568 
-0.00000000000601 
-0.00000000000633 
-0.00000000000665 
-0.00000000000698 
-0.00000000000730 
-0.00000000000763 
-0.00000000000795 
-0.00000000000828 
-0.00000000000860 
-0.00000000000891 
-0.00000000000925 
-0.00000000000957 


0.0000001 1920929 
0.0000000001 1642 


Results from Uniform Segmentation NFG compared with SRC Cosine Macro, MATLAB and Excel. 
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The results in Figure 18 show that the accuracy in the NFG can be increased to 33 
bits. To take advantage of the uniform segmentation, we need to know the number of 
segments required in uniform segmentation. The quadratic coefficients for the numeric 
functions are stored in OBM memory. Table 12 shows the number of segments required 


for each of the accuracies. All the segments shown can be implemented in the NFG, even 


when the number of segments is as large as 34483; as in the numeric function: ,/—In(x) 




































































Numeric Function Number of Segments 
B20 ae Bae 
phe 8 39 311 
Lx 17 81 646 
Vx a 33 257 
tx 11 aD 439 
log, (x) 13 64 506 
In(x) 12 56 448 
sin(7x) 14 70 559 
cos(7x) 14 70 559 
tan(77x) 18 88 704 
[-In(x) 794 4017 34483 
tan? (zx) +1 30 151 1204 
—(xlog, x + (1—x)log,(1-x)) 399 2013 16667 
if 5 23 178 
l+e* 
1 11 52 412 
e 2 
V20 
sin(e*) 125 627 5103 
Table 12. Number of segments required for Uniform Segmentation computed with 
N=1,000,000 for various values of . 
2. Fixed Point Implementation 


The fixed point implementation has a shorter pipeline depth. Numeric function 
2“ has a pipeline depth of 31 in fixed point and 84 in floating point uniform 
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segmentation. The multiplier inferred by the SRC accepts 64 bit operands and outputs a 
64 bit product that contains only the lower 64 bits of the computed 128 bit product. This 
present a challenge when computing in fixed point number system as discussed in section 


TI.B.4.a. 


Table 13 shows the fixed point implementation without any special adjustments to 
the bits. The function is 2*. The green portion of the table did not require any 
adjustment. In the yellow section, adjustments are required to eliminate the unintended 
sign extension of shifted values. The last two columns show the accuracy of the NFG. 
The very last column shows the accuracy when rounding is performed (rounding 


performed only in the final result, not at any intermediate points). 


6905840 104972342 MOAI 22S 
d20c146 1094364e6 109436464 
13b12a4d 10e051a07 10e051983 
1a419353 iiZdeasit 112dca498 
ACHES Set 11l7ca6aba 117ca69e0 
27626560 IVeceere? liccecf66 
2df2ce67 121ea3d94 12lea3d06 
3483376e 1271d1ld0a Ie kolkeryi9) 
3b13a074 IDSGTido 5 12c67d962 





0 
ab 
2 
3 
4 
S) 
6 
7 
8 





a41a41a4 il ehar 8) 7) abate! 18£374959 22, 
aaaaaaab 1965fea54 1965febld 23} 
dodeSioul Sion 19da96753 19da96689 22 





Table 13. Fixed point implementation of 2”, no bit shifts, N=1,000,000 and ¢ =2™. 
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Table 13 shows that the accuracy? degrades in segments of higher index. This is 
expected because uniform segmentation results in segments that have varied accuracy. 
Figure 19 shows the error expected for uniform segmentation of 2*, which is consistent 
with the results in Table 13. When implemented in hardware, this design does not meet 
the accuracy because the values are truncated at various intermediate points in the 


computation. The error propagates and magnifies the error in the result. 


A bigger problem exists in indexing. In Table 13, the coefficients used to 
compute the NFG output for index 24, were actually coefficients intended for segment 
25. The segment indexing failed to give the correct index. These problems contributed 


to the lower output accuracy as is seen in the second from last column in Table 13. 
The advantage of using 2* is that all values are less than 1.0 except for the last 


value; x is 1.0. No integers to deal with in this example. 


x 10° Error for UNIFORM f(x)=2* segmentation. No. of segs = 39. 
6 T T T T T T T T T 





& 


24-24.0402. 
DN) 


oO 


' 
ine) 


Max Error = 5.7966e-008 
K 








Figure 19. Uniform Segmentation of 2* , N=1,000,000 and ¢ = aa 


3 The endpoints of the segments are used as the x input values to test the numeric factions. The 
endpoints have the worst case approximation error. Table 13 shows the worst case scenario. 
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This implementation works for only a few functions. To make it work for the rest 


of the functions, a better method is required to handle integers and rounding. 


Table 14 shows the implementation adjusted to accommodate the integers. As 
described in III.B.4.b, an arithmetic shift right (8 bits) is performed on the multiplication 
operands before multiplication. The product now has 16 bits to represent integer portion 
of the product. This is enough for all the values that will be encountered in the suite of 


functions investigated. 


The worst case function is ./—In(x) of large coefficients. Whenever the 


coefficients are very large, the impact of small numbers is larger and therefore a greater 
room for errors exists. When the operands are shifted, the values are truncated which 


causes propagation of error to the product. Last column shows the accuracy. 


INDEX x x2 ax’2 bx fx Accuracy 
25 bits 

959 fTbb77c 3d£323a 8b2b82181 2lad8ba8  fffffffbOddal5al  ffffFELE6439c516 lech291c2 17299e284 
16 bits 
960 1£7fc3b5 3e0312a 8b09553aa 2ladedda fffffffb0de083fd  ffffffffo4364a70 lecaadc9 1728e84e6 
83 { 20 bits 

961 1£83cfee 3e1303a 8ae7350d8 2lae4ffb fffffffb0ebbelad  fffff£f£O432d005 1eca20868 172832868 
21 bits 

962 1f87dc27 3e22f6b 8acd215bd 2laeb200 fffffffb0eed1£75  f£ffFELE642£55ec lec99c5lb 17277cd06 
‘ 19 bits 

963 1f8be861 3e32ebe 8aa31a702 2lafl3fc  fffffffb0f733c23  f£ffFELE642bdbfd 1ec9182c8 1726c72cl 
{ { 22 bits 








964 1f8ff49a 3e42e30 8a8llfcf0 2laf75d3  fffffffb0£f93990  ffftff{fo4280272 ec894153 172611997 
965 1£9400d3 3e52dc4 $a5f31f3£ 2lafd7a4 fffffffbl07fl5ca ffffFFFf6424e90a lec8l00da 17255c189 oe 
966 1£980d0c 3e62d78 §a3d508d1 21b0395e fffffffb1104d205  ffftfFff64216fec lec78cl4c 1724a6a96 anee 
967 1£9c1945 3e72d4e Salb7b98d 21b09000 fffffffbll8abejd ffffftfffo4 rat ec7082a9 1723f14be ae 

; 28 bits 


968 lfa0257£ 3e82d44 89£9b3736 21b0fcad fffffffb120fe8f7  ffffffffo4layed3 lec6$4509 17233c001 


24 bits 
969 1fa431b8 3e92d5a 89d7£754d 2iblSelb  ffff££fb1295453d  f££EfEL£64170607 lec60083c 172286cbe 





25 bits 























970 lfa83df1 3ea2d92 89bo47fob 2lblb£92 ffffiffb131a8020  fffffftt64138dd2 lecdTcc71 1721d19d6 





Table 14. Fixed point, uniform segmentation of ,/—In(x) , multiplier operands shifted by 8 


bits, N=1,000,000 ande =2™. 
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In Table 14, the first column is the index into the array. The rows show two 
computations; the NFG is in the colored row and the row below shows the correct values 
which have been computed in MATLAB and converted to the number representation. 
Coefficients a, b the input x and x2 are shifted 8 bits in the NFG (colored rows). The 
intermediate products show the error in the intermediate steps. The two products; ax2 
and bx have been realigned before the final addition step. Table 14 shows the effect of 
the error as it propagates from the intermediate steps to the final answer. The last column 
shows the number of bits that match between the NFG output (in the colored row) and the 
desired output. This is basically telling how accurate the NFG has performed. As can be 


seen, there are instances where the error is large. 


Table 15 shows the pipeline depth is 32. It also shows the summary of place and 
route and hardware resource requirements to implement uniform segmentation using 
fixed point numbers. This data is the same for all the numeric functions. The memory 


file determines which numeric function will be implemented. 































































































































































































HHPHEHEEEEEEEEREE EERE E EERE EERE E EERE HEE HE HEHE HEE HE HE FEE 
HEE EEE HEHE HE HE FHF INNER LOOP SUMMARY HEHEHE EHH 
loop on line 54: 
clocks per iteration: 1 
pipeline depth: 32 
HHPHEHEEEEEEEEREE HEGRE EE ERE HE EREE EERE HEE HE HEHE HEE HEHE HEH 
HHEPHHEEEEE HEHEHE PLACE AND ROUTE SUMMAR HHEHEE HEHEHE 
Number of Slice Flip Flops: 8,751 out of 67,584 2% 
Number of 4 input LUTs: 3,282 out of 67,584 4% 
Number of occupied Slices: 5,226 out of 33,792 15% 
Number of MULT18X18s: 40 out of 144 27% 
freq = 100.0 MHz 
HHPHEHEEEEEEE REE HERE EERE HEGRE HE EEE HEE EH HEHE HE EH EH FEE 





























































































































Table 15. Pipeline depth and hardware resources for uniform implementation with no 
adjustments. 


Table 16 is a comparison of uniform segmentation between the floating point and 
fixed point NFG implementations. They both require the same size memory files, but the 
floating point hardware can handle a larger range of values than the fixed point 


implementation. 
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Floating Point Fixed Point Fixed Point / 
Floating Point 
Pipeline Depth 84 a2 38 % 
# of Slice Flip Flops 26% 12% 46 % 
# of 4 input LUTs 13% 4% 31% 
# of occupied Slices 33% 15% 45 % 
# of Block RAMs 0% 0% 0% 
# of MULT18X18s 44% 27% 61 % 
Freq 100.2 MHz 100.1 MHz 0 % 








Table 16. Comparison of uniform segmentation NFG between fixed point and floating 
point. 


B. NON-UNIFORM SEGMENTATION 
Non-uniform segmentation requires a segment index encoder. The SRC 
programming environment has a priority selector macro that is used as the segment index 


encoder for the NFG. 


1. Floating Point Implementation 

The priority selector macro in the SRC, is used as the segment index encoder. 
The priority selector has a limit (approximately 150 elements) when used in the NFG 
with three 64 bit multipliers. The non-uniform segmentation NFG, in floating point, has 


a pipeline depth of 74. 


The math macros available in the SRC have pipeline depths that vary. For 


2 
x 


e * implemented using the math macros has a pipeline depth of 274 as 





1 
example, ne 
shown in Table 17. Table 17 summarizes the hardware pipeline depth for the suite of 
numeric functions. The table shows side by side comparisons of the pipeline depth for 
the NFG and the SRC math macros. In 10 of the 15 functions, the pipeline depth is 
smaller. For one function the pipeline depths are the same and for 4 of the functions the 


NFG pipeline depth is larger. Regardless of the size of the function, the NFG has the 


same pipeline depth; the only exception issin(e*). It is only one clock longer. 
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Three functions in Table 17 are limited by the number of segments required. In 
the floating point implementation with 3 multipliers and the other hardware requirements, 
the FPGA runs out of resources to build large priority selectors. The priority selectors 
were limited to approximately 150 segments. Implementations requiring larger selectors 


did not compile on the MAP. The data was obtained by compiling in debug mode. Some 





of the implementations were built in hardware, for example: 












































132 74 | 35 
1/x 70 74 50 
aa 43 74 24 
tx 74 74 36 
log, (x) fie: 74 44 
In(x) 61 74 39 
sin(zx) 105 74 58 
cos(zx) 105 74 58 
tan(7x) 135 74 58 
inc) 127 74 1634 
tan’ (zx) +1 254 74 79 
—(xlog, x+ (1— x) log, (I-x)) 114 74 1837 
1 185 74 20 
l+e* 
. 274 74 45 
oe e 
sin(e*) 212 75 2654 




















Table 17. Pipeline depth for various implementations of using the available macros or the 


NFG in floating point number system. 


4 Note that these numbers (number of segments) are larger than 150, and cannot be realized in priority 
selector in the floating point implementation. 
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When both the NFG and the macro are built on the FPGA, a large amount of 
resources are consumed and the frequency may be affected due to place and route 
difficulties and increased delay in the wiring. Figure 20 shows the summary of the place 


x 


e is implemented with the macros and the NFG 





: : 1 
and route when numeric function a 
1 


both on the same FPGA. The frequency is 77.2MHz. 





















































































































































































































































HEE HEH EH EE EH EH HH HH HH HE HE EE EE HEH EH EE EE HE HE HE EE EE EE EE EE HEH 
HHEHEEE EEE HEE HE HEF INNER LOOP SUMMARY HEE HEH HEHE 
loop on line 53 
clocks per iteration: 1 
pipeline depth: 74 
loop on line 139: 
clocks per iteration: 1 
pipeline depth: 274 
HEEEEE EEE EEE EEE EEE EEE EEE HEE EE EE EE EEE EH HEE HEE EEE HEE HEE HH 
HEEHEEH HEE HE HEE PLACE AND ROUTE SUMMARY HEHEHE EHH 
Number of Slice Flip Flops: 51,967 out of 67,584 76% 
Number of 4 input LUTs: 39,520 out of 67,584 58% 
Number of occupied Slices: 33,790 out of 33,792 99% 
Number of Block RAMs: 3 out of 144 2% 
Number of MULT18X18s: 90 out of 144 62% 
freq = 77.2 MHz 
Ha HEH HEH EE EH EHH HH RE EE EE EE EE HEH EE EE EE EP HE RE EE EE EE EE EE HEH 














































































































Figure 20. NFG and macro both built on the FPGA for numeric function; 





1 
V2 
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The performance improves if only one is built at a time. Figure 21 shows the 


same function built on the FPGA using the NFG only. The frequency is 100.0MHz. 





























































































































































































































































































































HHPHEHEEEEEEEEREE EERE EE ERE HERR EE EGE HEE HE HEHE HEE HEH HEH 
HEEHEEH EEE HHH HE HEHE INNER LOOP SUMMARY HEHEHE 
loop on line 53: 
clocks per iteration: 1 
pipeline depth: 74 
HHPHEHEEEREEEEREE EERE E EERE HERR E EERE HEE HE HEHE HEE HEHE HHH 
HEHEHE HEE HE HEE PLACE AND ROUTE SUMMAR HEHEHE EHH 
Number of Slice Flip Flops: 26,377 out of 67,584 39% 
Number of 4 input LUTs: 16,386 out of 67,584 24% 
Number of occupied Slices: 17,473 out of 33,792 51% 
Number of MULT18X18s: 48 out of 144 33% 
freq = 100.0 MHz 
HHPHEHEEEREEEEREE EEE EE ERE HEGRE EE EGE HEE HE EEE HEE HEE HEH 
: : : : 1 —- 
Figure 21. NFG built on the FPGA for numeric function; ———e ? . 


V2n 


2 


1 ca 
Table 18 shows the results from computing ——e *, with N=1,000,000 


V2n 


ande=2~*. The values are displayed to twelve decimal places. This function requires 
45 segments. The values of x that are tested in Table 18 are the endpoints of the segment 
and therefore have the worst case> approximation error. At the very bottom of Table 18 
is ¢=2™~ in decimal. The last column shows the approximation error is consistently 


smaller than ¢ ; per the design. 


5 If the x input to the NFG were somewhere in the middle of the segment, the approximation error 
would be smaller. There are four points in a segment with worst case approximation error. Figure 10 is a 
good example to see the distribution of the approximation error on a non-uniform segment. 
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1.190056245939 
1.221750217779 
1.254138569173 
1.287320295169 
1.321418432469 
1.356585716292 
1.393018722519 
1.414213562373 


0.196507285750 
0.189138515593 
0.181705027166 
0.174202711977 
0.166624545576 
0.158960217743 
0.151194300960 
0.146762652495 


0.19650736451 1 
0.189138561487 
0.181 705087423 
0.174202784896 
0.166624620557 
0.158960267901 
0.151194363832 
0.146762669086 


0.196507345350 
0.189138575189 
0.181 705086768 
0.174202771576 
0.166624605173 
0.158960277339 
0.151194360555 
0.146762663174 


0.000000019161 
-0.000000013702 
0.000000000655 
0.000000013320 
0.000000015384 
-0.000000009438 
0.000000003277 
0.000000005913 


194 clocks 
396 clocks NFG SRC OUTPUT Excel SRC-Excel NFG-Excel 
0.065896761049 0.398076980336 0.398077040911 0.398077039931 0.000000000979 -0.000000059595 
0.113411555833 0.396384819146 0.396384894848 0.396384878748 0.000000016100 -0.000000059601 
0.155068672183 0.394174398848 0.394174486399 0.394174458446 0.000000027952 -0.000000059598 
0.193392483833 0.391551192645 0.391551256180 0.391551252249 0.000000003931 -0.000000059604 
0.229466279456 0.388576167409 0.388576239347 0.388576227007 0.000000012340 -0.000000059598 
0.263888271986 0.385290687187 0.385290741920 0.385290746785 -0.000000004864 -0.000000059597 
0.297033228393 0.381725674206 0.381725728512 0.381725733800 -0.000000005288 -0.000000059593 
0.329159950016 0.377905230684 0.377905279398 0.377905290287 -0.000000010889 -0.000000059603 
0.360453699018 0.373849440845 0.373849511147 0.373849500448 0.000000010698 -0.000000059604 
0.391055896896 0.369575196048 0.369575262070 0.369575255651 0.000000006419 -0.000000059603 
0.421076852419 0.365097238417 0.365097314119 0.365097298012 0.000000016107 -0.000000059595 
0.450608489560 0.360428130051 0.360428184271 0.360428189648 -0.000000005377 -0.000000059598 
0.479725761713 0.355579258917 0.355579316616 0.355579318519 -0.000000001902 -0.000000059601 
1.010456600772 0.239440565640 0.239440649748 0.239440625229 0.000000024519 -0.000000059589 
1.039409823988 0.232439528403 0.232439562678 0.232439587993 -0.000000025315 -0.000000059590 
1.068692559293 0.225374753587 0.225374817848 0.225374813189 0.000000004659 -0.000000059602 
1.098347233137 0.218248244336 0.218248322606 0.218248303940 0.000000018667 -0.000000059604 
1.128421928829 0.211061263284 0.211061343551 0.211061322887 0.000000020664 -0.000000059603 
1.158970386539 0.203814501730 0.203814581037 0.203814561328 0.000000019709 -0.000000059597 


-0.000000059600 
-0.000000059596 
-0.000000059602 
-0.000000059599 
-0.000000059598 
-0.000000059596 
-0.000000059595 
-0.000000010679 





0.065896761049 0.398076980336 0.39807704091 1 0.398077039931 0.000000000979 -0.000000059595 
0.113411555833 0.396384819146 0.396384894848 0.396384878748 0.000000016100 -0.000000059601 
0.155068672183 0.394174398848 0.394174486399 0.394174458446 0.000000027952 -0.000000059598 
0.193392483833 0.391551192645 0.391551256180 0.391551252249 0.000000003931 -0.000000059604 
0.229466279456 0.388576167409 0.388576239347 0.388576227007 0.000000012340 -0.000000059598 
0.263888271986 0.385290687187 0.385290741920 0.385290746785 -0.000000004864 -0.000000059597 
0.297033228393 0.381725674206 0.381725728512 0.381725733800 -0.000000005288 -0.000000059593 
0.329159950016 0.377905230684 0.377905279398 0.377905290287 -0.000000010889 -0.000000059603 
0.360453699018 0.373849440845 0.373849511147 0.373849500448 0.000000010698 -0.000000059604 
0.391055896896 0.369575196048 0.369575262070 0.369575255651 0.000000006419 -0.000000059603 

2.24 Accuracy 0.000000059605 
1 = 
Table 18. Comparison between SRC macro and NFG; numeric function Tx e?, 
va 


N=1,000,000 ande =2™. 


2. Fixed Point Implementation 
As mentioned before, the advantage of using fixed point is the reduction in 
hardware and the reduced pipeline depth. The disadvantage is that is takes more work to 


program. 


Macros may be used to define certain behavior that is easier to describe in HDL 
or to provide special functionality that is not available in regular programming. In the 


NEG, the multiplier is limited by the 64 bit architecture. 
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The product of two 64 bit 


numbers does not give the user access to all 128 bits in the product. HDL can be used to 


manipulate and access the desired bits. 


a. No Macro Multiplier (non-uniform) 

The fixed point implementation without a macro is exactly the same as the 
fixed point implementation with only one exception; the indexing in non-uniform 
segmentation is accomplished using the user callable macro, priority selector, available in 


the SRC. 




































































































































































































































































HHPHEHEEEEEEE REE EERE E EERE HEGRE EERE HE EEE HE HEHE HE EH EH HEH 
HHEPHEEEEHE EH HEHEHE HE FH INNER LOOP SUMMARY HEHEHE EHH 
loop on line 46: 
clocks per iteration: 1 
pipeline depth: 28 
HEHEHE EEEEEEEREE HEGRE EE ERE HEGRE EERE HE EEE HE HEHE HEE HEHE HEH 
HHEPHEHEEEEEH EHH PLACE AND ROUTE SUMMARY HEHEHE EHH 
Number of Slice Flip Flops: 8,283 out of 67,584 12% 
Number of 4 input LUTs: 12,331 out of 67,584 18% 
Number of occupied Slices: 11,256 out of 33,792 33% 
Number of MULT18X18s: 30 out of 144 20% 
freq = 100.2 MHz 
HEHEHE EREEE REESE EERE EERE HE EERE HEE HE HEHE HEE HEHE HEHEHE 























































































































Table 19. Pipeline depth, place and route summary for ./—In(x) , N=1,000,000 ande =2™. 
Non-uniform segmentation using priority selector macro. 


b. Macro Multiplier Implementation 

The goal is to build a multiplier in VHDL or Verilog that can successfully 
multiply in two’s complement and provide a result that is already shifted into the number 
system chosen for fixed point. Specifically, we want a product that is 32 bits integer and 


32 bits fraction. 


Several multipliers were built. The multipliers function correctly in 
simulation on PC’s using Xilinx ISE, Project Navigator and Modelsim simulating 
software. However, when the VHDL or Verilog files were compiled on the SRC, the 
products were not correct. This version was implemented, but it did not produce correct 


products. 
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Appendix B shows the VHDL code for a 32x32 bit multiplier with a 32 bit 
product. The design instantiates the 18x18 signed multiplier primitive. The design 


makes use of a modified I/O pipeline design from a Xilinx application note [22]. 


Appendix B also shows the Verilog file for a 64x64 bit multiplier with a 
64 bit product. The 64x64 bit multiplier makes use of the source code for the 64x64 bit 
multiplier macro designed by SRC. 


Cc: SOURCES OF ERROR 

The floating point implementation has only errors associated with the MATLAB 
computed values and the restrictions placed oné. When implemented in the SRC, 
double precision accurately represents what is expected from the values fed into the NFG 


and the coefficients table. 


The fixed point implementation had errors due to several reasons. We explore 


some of those reasons for error in the NFG as a whole. 


1. Function Approximation 
Both floating point and fixed point have to work with approximation error. This 


is discussed in detail in section II B (Segmentation). 


Ze Absence of Rounding in the Multiplier 
The fixed point implementation of the NFG shifts binary bits and truncates lower 
and upper bits. This introduces error in computing the products and these errors 


propagate to the final answer. 


3. Insufficient Bits 
Insufficient bits to represent the full product means that the numbers have to be 


shifted and truncated. This limits the ability for the NFG. 
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D. SUMMARY 

The NFG implementation of the uniform segmentation using floating point 
number system has a pipeline depth of 84 or 98 depending on whether the begin point of 
the domain interval is zero or non-zero (zero is preffered). This implementation must 
read a memory file containing the polynomial coefficients into OBM. Aside from these 
requirements, the NFG implemented in uniform segmentation and floating point number 
systems, provides advantages over using the available user callable macros and the math 
operators. It can be implemented in very high precision, shorter pipeline depth and in 


some cases less hardware. 


Another advantage of the uniform segmentation is that once compiled, the NFG 
can compute any of the 15 functions. The memory file with the coefficients must be 


available. 


The NFG non-uniform implementation has a shorter pipeline depth, but requires 
much hardware to implement the segment index encoder. The segment index encoder is 
limited to approximately 150 segments in this design. Depending on the function, the 
precision can be increased as long as the number of segments does not exceed 


approximately 150. 


The fixed point implementation requires a rounding macro and a good macro 
multiplier to provide the desired product bits and make it effective. However, it provides 


a significantly smaller pipeline depth than the floating point implementation. 


A real advantage of the NFG is when very complicated numeric functions need to 
be implemented; the NFG has a constant pipeline depth unlike the more complicated 


functions that have long pipeline depths. 


More research is required to realize a complete NFG design. Section VI discusses 


some suggestions for future work. 


de 


VI. CONCLUSION 


A. SUMMARY OF WORK 
An efficient and fast segmentation of numeric functions was accomplished in 
MATLAB. Table 20 shows the number of tests (calls to chebyRemz) required to segment 


the suite of 15 functions. 












































Epsilon = 0.0000000596 = 2%-24.0. N = 1000000 
Function Interval SOf tests # of Segments 
2°x% [0,1] 0.00910 35 
Lei X [1,2] 0.01020 50 
sqrt (x) [1,2] 0.00750 24 
1/sqrt (x) [2.,.2)] 0.00720 36 
log2 (x) [1,2] 0.00900 44 
log (x) [1,2] 0.00780 39 
sin(pi*x) [0,1/2] 0.01990 58 
cos (pi*x) [0,1/2 0.01740 58 
tan (pi*x) [0,1/4] 0.01240 58 
sqrt (-log(x... [1/512,1/4] 0.04070 163 
Ean (prexy Pst [0,1/4] 0.02180 719 
—(x*log2(x)... [1/256,1-1/256] 0.04710 183 
1/ (1t+exp(-x... [0,1] 0.00920 20 
(L/ sqrt (24 ps. [0, sqrt (2) ] 0.01670 45 
sin(exp (x) ) [0,2] 0.07810 265 
KKEKKKKKKKKKKKKKKKKKKKKK KKK KKKKKKKKKKKKKKKKKKKKKKKKKKKKKK 











Table 20. Speed-up in computation time for 15 functions (expressed as a percentage of the 
time needed when the domain is divided into 1,000,000 points) for ¢ = age 


The NFG circuit built in the SRC was very effective in floating point. The 
computation of numeric functions in the NFG was shown to obtain accuracy of up to 33 
bits. Higher accuracy is possible at the cost of increasing the size of the memory files 


required to store the coefficients. 


Floating point implementation was easier to build on the SRC than the fixed point 
implementation. However, floating point implementation takes up a large amount of 


FPGA resources. 
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The NFG is a useful technique to compute complicated numeric functions that 
would otherwise require a combination of several other arithmetic operations. The more 
demanding the numeric function the more reason to use the NFG instead. The NFG is 
more efficient in 10 out of the 15 functions that were investigated in this thesis (when 


using the non-uniform segmentation). 


The fixed point implementation did not produce all of the desired results. The 
multiplication required more programming than the floating point implementation 
required, but the results had errors due to rounding and truncating the intermediate and 
final results. This area needs more research to improve. The advantage of fixed point 
implementation is that it requires much less hardware than floating point and therefore 
can reduce the pipeline depth to about 30% of the pipeline depth required by the floating 


point implementation. 


B. SUGGESTED FUTURE WORK 

I; Hybrid of Uniform and Non-Uniform Segmentation 

Uniform segmentation is much faster and less complicated than non-uniform 
segmentation. Although non-uniform segmentation may not be practical on its own, a 
hybrid of non-uniform and uniform segmentation would take advantage of the strengths 


of each. 
Consider a numeric function that is not suitable for uniform segmentation, such 
as ,/—In(x) , which appears in Figure 4 to demonstrate this fact. In the non-uniform 


segmentation of the same function; such as Figure 2, the restricting portion is the 
beginning of the segment. Therefore to capture the most restricting part of the numeric 


function, segment the numeric function into a few non-uniform segments. 

A good starting point is to determine an upper limit for the total number of 
constant segments. Let us decide on 400 segments. If we dedicate 100 constant 
segments to the first portion of the numeric function ,/—In(x) , then change the segment 


size for another 100 constant segments and repeat this process four or five times, we will 


have five non-uniform segments each containing a set of uniform segments. 
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This method would provide three advantages: 


1. Relieve the segmentation constraint from the most restricting segment. 

2. The segment index encoder would be small (5 groups of segments) and save 
FPGA space. 

3. The indexing would be less complex once the input has been mapped to the 


correct group of segments. 


2. Expand the Domain of the NFG via Mapping 
The functions investigated in this thesis have a limited domain interval. To make 
the functions useful for a wide range of applications, the domain interval should be 


increased. Theoretical research is being conducted in this field [21]. 


3. Build an HDL Multiplier Macro and Tap of Desired Bits 
If the multiplier in fixed point were built in a macro, the desired bits could be 


tapped off. This implementation would be both fast and accurate. 


3 Build a Rounding Macro 
A macro can be built to round off shifted values in the fixed point implementation 
instead of truncating the values. This would improve the accuracy in the output of the 


products and the final result of the NFG. 


4. Efficient Segment Index Encoder vice Priority Selector Macros 

The priority selectors are fast and work well, but take up a lot of hardware. 
Combined with the other hardware in the NFG, the priority selectors take up all the 
resources and limit the accuracy and flexibility of the NFG to handle all the functions. 
An implementation that uses a more efficient method for the segment index encoder 


would benefit the NFG. 


Sasao, Butler have three suggestions; (1) LUT cascade, (2) Content addressable 
memory and (3) EVBDD. 
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2. Different Architecture 

If FPGA resources became scarcer and one wanted to implement a larger 
coefficients table, the only way to make room is to remove the major consumers of real 
estate. In the NFG, it would be the segment index encoder that is implemented as a 
priority selector macro and the multipliers. We have already discussed possible solutions 


to removing large selectors. 


Using Horner’s rule, a multiplier can be eliminated from the NFG. Equation (0.5) 


shows how to apply Horner’s rule to the NFG. 


f(x)= Cae + O,X + Cy =(C,X+C,)X + Cy (0.5) 


The NFG hardware would add one more adder stage, however if the segment 
index encoder were able to work in one or two clocks, this would be a speed-up from the 
previous architecture as long as the adder stages take fewer clocks than the multipliers. 
Floating point adders can take as many clocks as the multipliers, but in two’s 


Complement or signed magnitude, the adders are faster than the multiplier. 


In the previous architecture, x’ takes many more clocks than the segment index 
p y g 


encoder and adds to the pipeline depth. 


Figure 22 shows an overview of the NFG architecture when Horner’s rule has 


been applied. 
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input X 


Encoder 


Coefficients Table 










f(x)=C,X°+C,'X+C,' 


Horner’s rule NFG architecture overview. 
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APPENDIX A. MATLAB ALGORITHMS 


The following MATLAB Code generates the segmentation for any function; 
however a user interface has been added for convenience. The user simply picks a 
number instead of re-typing the entire function or the interval for evaluation. The 


interface limits the MATLAB Code to the suite of functions found in Table 1. 


A.l QUADRATIC APPROXIMATION USING POLYFIT 

This code implements the quadratic approximation using the MATLAB function 
Polyfit. There are 6 files needed to run the non-uniform and uniform segmentation: 
QuadAppxPfit.m, multipleQuadApprox.m, varQuadApprox.m, dec2binfp.m, 
constantQuadApprox.m, and constQuadAppxWErr.m. 


QuadAppxPfit.m is the top function where the program starts and ends. All the 
other files are child functions that provide the segmentation data back to this file for 


presentation / file storage. 


multipleQuadApprox.m calls the non-uniform segmentation algorithms to collect 


the data for the segment endpoints and coefficients. 


varQuadApprox.m tests proposed segments and reduces finds the optimum width 


of the segment by testing the approximation error to ¢. 


dec2binfp.m is the file that converts decimal numbers into binary. This is limited 


to converting one integer value and only up to 9 binary bits of accuracy. 


constantQuadApprox.m is used for uniform segmentation when the number of 
segment is known before hand. The key requirement is to input the number of segments 


desired, the approximation error is unspecified. 


constQuadAppxWErr.m needs to have € specified, then this file will compute the 


uniform segmentation of the numeric function that meets the constraint ¢. 


fi) 


Pp. 
De 


FIL 





QuadAppxPfit.m 





Arbi 
Crea 
Last 
Modi 


This 


It is 


Produced by: 


trary_PW_Quadratic_Approx.m 

ted: January 6, 2006 (from Arbitrary_PW_Linear_Approx.m) 
modified: October 20, 2006 

Tom Mack & Jon Butler 

fied by: Njuguna Macaria for quadratic approximation 


program produces a seg 
1. Uniform pi 
2. Non-unifor 
3. Both 


mentation of a given function using either: 
ecewise Quadratic approximation 
m piecewise Quadratic approximation 


based on the algorithm: 

1. For non-uniform, the MATLAB polyfit function 

2. For uniform, dividing the range of the input into 
equal, user-defined segments 
or by using max error to determine max segment length 
at the greatest curvature and then dividing the range 
up into equal segments. 
All with intercept shifting to balance the positive 
and negative error 








Inputs 


f 


x) 


low 





number of elements 
function to be eval 


low end of interval 


on whic 
luated 
over w 





h function is expressed 


hich f(x) is evaluated 





Outp 





x_high 
epsilon 
consegs 


high end of interval over which f(x) is evaluated 
precision of approximation (for variable only) 
number of segments to use to approximate (constant only) 





uts 
Segment info —- Segment #, Begin Pt, 





End Pt, & 





Coefficients, Error 












































% Plot showing the approximation 
% Text file used to initialize memory in SRC (both Binary & Decimal) 
$SSSSSSSSSSSSSSSSS6S66% INPUT OF USER-SPECIFIED PARAMETERS %333335SS5SSSSSSSS%S% 
clear 
close all 
format long g 
forintf('\n' ) 
Forintf V\ RK RR RR I I OR RK EK m) 
forintf('\n' ) 
Forintf('\n QUADRATIC APPROXIMATION OF A FUNCTION USING POLYFIT '! ) 
forintf('\n' ) 
forintf('\n' ) 
6% Get FUNCTION to be approximated (user input) 
func = input( 'Input the Function, func[sgrt (-1*log(x))]: BSN )es 
if isempty (func) 

func = 'sqrt(-1*log(x))'; 6% default 
end 
%% Get LOW range (user input) 
x_low = input( 'Input the Lower Range of x —- LOW value, x(low) [1/256]: Vy 
if isempty (x_low) 

x_low = 1/256; %% default 














end 


%% Get HIGH range (user input) 


x_high = input( 'Input the Higher Range of x - HIGH value, x(high) [1/4]:'); 
if isempty (x_high) 
x_high = 1/4; %$% default 
end 
%% Get CONSTANT OF VARIABLE segmentation (User input) 





vari_or_const = 0; 
while vari_or_const 
vari_or_const 
input ( '(1)Non-uniform 
if isempty (vari_or_const) 
vari_or_const = 1; 


1 && vari_or_const 


(2)Uniform S$ 

6% default 
end 

end 


$% If non-uniform segmentation, then enter 





if vari_or_const ~= 2 
epsilon = input( 'Input the Desired Err 
if isempty (epsilon) 
epsilon = 0.0001; $% default 
end 


end 


$% If uniform segmentation, find how the us 


if vari_or_const == 
err_or_segs = 
input ( 'Constrain by (1)Number of S 
if isempty (err_or_segs) 
err_or_segs = 1; $% default 
end 
if err_or_segs == 1 
consegs = input( 'Input the number 
if isempty(consegs) 
consegs = 200; $% default 
end 
end 
if err_or_segs == 
epsilon = input( 'Input the Desired 
if isempty (epsilon) 
epsilon = 0.0001; $% default 
end 
end 
end 
N = input( 'Input the no. of pts the fct is 
if isempty (N) 
N = 10000; $% default 
end 
% eqn = input( 'Input the equation to use: 


oe 


(1) F (x) =ax*%2+bxt+c or 
if isempty (eqn) 


ole 


(2) F (x) 


= 2 && vari_or_const 3 


(3)Both [1]: 


egmentation or 


Non-uniform 





ERROR parameters 
or, epsilon[0.0001]: '); 


er will restrict # of segments 


(2) 





egments or Error 


1) 


of Desired Segments[100]: "); 





Error, epsilon[0.0001]: be a 


to be evaluated; N[10000]: 








a(x-p)*2+b(x-p)t+c, [1]: 
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default 


ao 
6% 


find the 


$%%S x values to calculate and spread over the approximating function 


Based on the number of points to be used for the curve, 


gh, N); 


x_hi 


x = linspace(x_low, 


the first element of 


t of the LAST segment. 
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temen 








The segments in this program do NOT overlap 
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Evaluate the function and place values in F 
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SPrint demarcation line 


PPT nee CE PRA eae SAAR RRA RRA ARE AOR RR IEE AAR ROR RR RGR AOR AR RCRA NK RAK, ') 


fprintf('\n') 


ation Algorithm 
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REPEAT FOR 


repeat = 


while repeat 


1, 


if 


, 























SG n a 
O fon) SG 
a oO | O 
-d n oa 
n SG “dl 
Q, e) pn 
oO oO n QO, 
. . Gao 
fy fy Os 
S S Oo fe 
x x Jos 
ed — ux 
x x Oo 
Oo Oo | 4 
Mu Mu “dou 
Q, Q, 4 
Q, Q, Cs 
ie ie > x 
ae) xe) eo 
oO © Q, 
DS 2) —K 
Ol Ol —TDd 
oO p Oo 
o c aa 
Q, © ~O 
“d ~ py N p 
p do n 
o c oS 
=) | O ll Oo 
= I oO d) 

n 
ll n Il D Il 

con) oO 

oO n 
fo) no |o 
| || yu 
6) 4 O OU 
. ome = 
1 [4 Mod 
| us| uo 
~ 0 4 O ovo 
ao.s Os ~ . 
N ous a N 
to | | we | 
ll oO w oO ww oO 
x Woe x. 
—~ p 1) —~ p 
NG ~G NG 
sed N -d “d 
po (e) | O 
n OQ, Il Qy Il Qy 
aren) | | 
OO xe) 7 e) 
0g pac na 
| © n oO Ga Oo 
eel | a | Oo | 
oO OV OO om ey) 
| oO oOo oO | oO 
4 | n un 
Mos Mos Os 
oC p Op | p 
> OQ | Q “4d Oy 
eee ard 4 
OG ug OG 
Oo Wo oO vO > Oo 

€ > 
ue) ue) 
Ww aw aw 
“ O-d O-d 


end 


oe? 
oe 
le 
le 
oe 
le 
oe 
le 
oe? 
le 
oe 
ole 
oe 
le 
oe 
le 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
ole 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 
oe 


oe 


approximate function and error 


Compute and plot function, 
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Index for each segment 


% 


ind 


, 


length (seg_end_point) 


= 


for i 


Index within each segment 


% 


on 


XP 


FP 





(ind < seg_end_point (1) ) 


while 


= x(ind); 
= F(ind); 


XP (m) 


Actual function (Fct No correction) 


% 


FNC (m) 
FP (m) 
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end 


Keep track of all errors 
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if (mod(i,2) == 0) % Plot every other segment a different color 





figure (mod(vari_or_const,2)+1) $% Blue 
plot (XP,FP) 
Figure (mod (vari_or_const,2) +3) $% Blue 





plot (XP,Error) 


Figure (mod (vari_or_const,2)+1) 
plot (XP,FP,'r', 'LineWidth', 2) S% Red 
Figure (mod (vari_or_const,2) +3) 











































































































plot (XP,Error,'r', 'LineWidth', 2) %%S Red 
end Sif (mod(i,2) == 0) 
figure (mod(vari_or_const,2)+1) 
hold on 
xlabel('x', 'FontSize',10) 
ylabel('f(x)', 'FontSize',10) 
if (mod(vari_or_const,2) == 1) 
title(['NON-UNIFORM f(x) segmentation. No. of segments = ',... 
num2str(length(seg_end_point)),'.'],'FontSize',10) 
elseif (mod(vari_or_const,2) == 0) 
title(['UNIFORM f(x) segmentation. No. of segments = ',... 
num2str(length(seg_end_point)),'.'],'FontSize',10) 
end 
figure (mod (vari_or_const,2) +3) 
hold on 
xlabel('x', 'FontSize',14) 
% Pick the maximum error from all the segments 
ylabel(['Error(x). Max Error = ',num2str(max(MaxError)),'.'],... 
'"FontSize',10) 
if (mod(vari_or_const,2) == 1) 
title(['Error for NON-UNIFORM f(x) segmentation. No. of segs = 
num2str(length(seg_end_point)),'.'],'FontSize',10) 
elseif (mod(vari_or_const,2) == 0) 
title(['Error for UNIFORM f(x) segmentation. No. of segs = ',... 
num2str(length(seg_end_point)),'.'],'FontSize',10) 
end 
end for i = 1l:length(seg_endpt) 
figure (mod (vari_or_const,2)+1) 
plot (x,F) % Plot function on same figure as piecewise approximation 
stem(x(seg_end_point),F(seg_end_point) ) 
hold off 
SSSSS5S%S Decimal to Binary Conversion Algorithm SESESSESESEEEEESSESEESES 


% Convert string end points, c_l and c_0 into a binary st 
% integer bit and 8 fraction bits and print results table. 


if (mod(vari_or_const,2) = 











1) 
fprintf('\n NON-UNIFORM Segmentation') 
elseif (mod(vari_or_const,2) == 0) 
fprintf('\n UNIFORM Segmentation") 
end 
if eqn == 1 
fprintf('\n Segment End Point End Point CisZ Nop helisio 
hee 2 c_l ofall c_0 Kipn eiiacd 
"oc 0') 
fprintf('\n Number (Decimal) (Binary) (Decimal)',... 
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bin 


' (Binary) 


(Decimal) 








(Binary) 


" (Decimal) (Binary) ') 
end 
for i = 1l:length(seg_end_point) 
xbin(i) = dec2binfp(x(seg_end_point(i))); 
segment (i+1) x (seg_end_point (i)); % Used in next program 
c_2bin (i) = dec2binfp(c_2(1)); 
c_lbin (i) = dec2binfp(c_1(i)); 
c_Obin (i) = dec2binfp(c_0(i)); 
if eqn == 1 
% Print Remaining Results Table 
fprintf('\n $3d S8.6f $019.9f %10.5f $019.9f ',... 
"S10.5f $019.9f %10.5f S019. 98 ita; 
x (seg_end_point(i)), xbin(i), c_2(i), c_2bin(i), c_l(i),... 
c_lbin(i), c_0O(i), c_Obin(i) 
end % if eqn == 1 
end %Sfor i = 1:length(seg_end_point) 


% 


Create text file of Binary values to initialize memory 
































memBin = [c_2bin .* 10%9; c_lbin .* 10%9 ; c_Obin .* 10%°%9]; % 
fid = fopen('memory.mem', 'w'); 
foprintf (fid, '\n%018.0£%018.0f%018.0f',memBin) ; 
Fclose (fid); 
% Create text file of Decimal Values to initialize memory 
fid = fopen('memDEC.mem', 'w'); 
Format long g; 
forintf(fid,'s5d', length(seg_end_point)); % Number of Segments 
memDEC = [segment (2:end); c_2; c_l; c_0] 
fprintf (fid, '\n%18.12f %18.12f %18.12f %18.12f£',memDEC) ; 
Felose (fid); 
sEnd text file creation 
ZLEGLCLLCLLCLLLCLLCLLLCLLLLLLLLCLLLLCLLLLLLLLLLLLLLLLLLCILLLLLLLLAAAAAAAAAAAAAAAD 
0000000000000 00000000000000000000000000000000 0000000000000 000000000000 OHO 
ZLEGLCLCLCLCLLLLLLLLCLLLLLCLLLLLLLAILCLLLLALLLLLLLALLALACAAAGLGLLCALGAAAAAAAAAAAAAAAD 
0000000000000 00000000000000000000000000000000 000 0000000000000 000 000000 O 
if eqn == 2 
$$S5%S%SSSSSSSS5S5S5%% The following created from: Extract_PL_Params.m 
% 
6 This program extracts from the segmentation and the function, the 
% 1. Squared term coefficient 
% 2. Linear term coefficient 
% 3. Constant 
o 
oO 
g 
fo} 
% which are the parameters needed to store in the coefficients 
% memory. It produces the BINARY values of these parameters. 
% 
% The segmentation occurs as a vector of end points. 
% 
ZLEGLCLCLCLLCLCLLCLLLLCLLLLCLLLLLLCLLCLLLLLALLCLALALCLAAALALAILA8AAAAAAAAAAAAAAAAAD 
000000000000 00000000000000000000000000000000000 0000000000000 000 00700 
fprintf('\n') 


PPLE CNV A RR RA RR HARA IKK REI TE AHR BRIER KOREA KOK RRA OR ROKR AACA ARIK K IK RRR AK RRA K hy) 


2g 


2g 


Memory with 
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segment. 


£(x)\n') 


fprintf('\n') 
segment (1) = 0; 
for i = l:length(segment) 

seg_index(i) = floor (N*segment (i) / (x_high-x_low) ) +1; 
end for i = 1l:length(segment) 
seg_index; 
for i = 2:length(segment 






















































































) 
slope (i-1) = (F(seg_index (i)-1) F (seg_index(i-1)))/... 
(x (seg_index(i)-1) x (Seg_index(i-1))); 
intercept (i-1) F (seg_index (i) -1) slope (i-1) *x(seg_index(i)-1); 
a = max (F (seg_index (i-1) :seg_index(i)-1) 
— (slope (i-1) .*x(seg_index(i-1):seg_index(i)-1)... 
intercept (i-1) ) \; 
b = min(F (seg_index (i-1) :seg_index(i)-1) 
— (slope (i-1).*x(seg_index (1-1) :seg_index (i) -1) 
intercept (i-1) ) Ae 
error (i-1) = 0.5*(a + b); SYES, it is a+b. 
intercept (i-1) intercept (i-1) 4 rror(i-l) + slope(i-1)*segment (i 
s_m_e(i-1) = segment (i) - segment (i-1); 
clx(i-1) = s_m_e(i-1)*slope(i-1); 
approx (i-1) = clx(i-1) + intercept (i-1); 
exact (i-1) = 2*segment(i); SExact value of f(x) at end of 
end for i = 2:length(segment) 
fprintf('\nDECIMAL values for Approx = slope*(x - pivot) + intercept.') 
fprintf('\nseg no. [s, e] slope intercept Mane 
"pivot approx_error e-s (e-s)*slope (e-s)',... 





"xslopetintercept exact f(x)\n"') 
for i = l:length(segment)-1 
fprintf£('S1.0f [%8.6f %8.6f] %8.6f $8.6f $8.6f %8.6f',.. 


'"S8.6f %8.6f 68.6 $8.6f \n', i-1, segment(i),... 


segment (i+1), slope(i), intercept(i), segment(i), error(i) 
s_m_e(i), clx(i), approx(i), exact (i) ) 
end %Sfor i = 1:length(segment)-1 
Shold on 
splot (x(1:N),slope(1).*x(1:N)+intercept (1) ) 
SConvert s, e, slope, intercept, and pivot to binary. 
fprintf('\nBINARY values') 





fprintf('\nseg no. [s, e] slope intercept',... 


peee 





"approx_error e-s (e-s)*slope (e-s) *slt+intercept 


for i = l:length(segment)-1 


digits = ceil (log2(length(segment)-1)); 
s_seg_no = dec2bin(i-1,digits); 
s_s(i) = dec2binfp (segment (1) ); 
s_e(i) = dec2binfp (segment (i+1)); 
s_slope (i) = dec2binfp (slope (i)); 
s_intercept(i) = dec2binfp (intercept (1)); 
if error(i) < 0; 

error(i) = abs(error(i)); 
end %$ if error(i) < 0; 
s_error(i) = dec2binfp(error(i)); 
s_s_m_e(i) = dec2binfp(s_m_e(i)); 
Ss -elx.(a:) = dec2binfp(clx(i)); 


xact 
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dec2binfp (approx (i)); 
= dec2binfp (exact (1)); 


S_approx (i) 


(1) 
fprintf('Ss 


s_exact 


610.8f %10.8f %10.8f %10.8f',... 


[610.8f %10.8f] 


"$10.8f 310.8f %10.8f \n', 


s_seg_no, s_Ss(i), 
s_error(i),S_s_me 


(i), 


pt(i), 


s_interce 


pe(i), 


s_clx 


s_slo 


(i)) 


s_exact 


S_approx(i), 


(i), 


S$for i 


end 


end 
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fprintf('\n') 


PBL EE CNA AR AAR RAR AIN RATA K RRA IIR IK, ARERR RRA ASAI AAR RAR RRR IA ASK OR KK ") 


fprintf('\n') 


3 


if vari_or_const 


repeat = 


1, 


0 


end 


vari_or_const 


if vari_or_const 


end 


while repeat = 1 


End file 
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G 
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fo) 
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end 


QuadAppxPfit.m 





% 
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multipleQuadApprox.m 
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function 
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Quadratic-line approximations of a 
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multipleQuAdapprox (x, fct,max_error) 


[endpt, indx,c2,cl1,c0] 
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This function will produce multip 


% 


given function to within the bounds of max error provided. 


Created by Tom Mack for linear approximations 


fo) 
xd 


% 


2006 


: Mar 31, 


Modified for Quadratic approximations by Njuguna Macaria 


Created 
Modified 


JP JP ol? 


2006 


Dec 30, 
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FILE: varQuadApprox.m 











function [endpt,i,c2,cl,c0O] = varQuadApprox (x, fct,max_error, indx) 

This function creates a 2nd Order approximation of a given function 
using the polyfit function. It continues to calculate polyfits until 
maximum error is exceeded. 

Linear approximation Created by Tom Mack >> Mar 31, 2006 


AP AlP oP AAP ol? 


oe 


Modified for Quadratic approximation by Njuguna Macaria 
Modified: Dec 29, 2006 


ole 


for i=indx:lengt 


h(fct); 
p = polyfit (x 

) 

) 

) 


( 
(indx:1i),fct(indx:i),2); % Fit equ to 2nd order poly 





c_2(i) = p(1l); % Coefficient of X%*2 
CCL) =p) ¥ % Coefficient of X 
c_O0(i) = p(3); % Intercept of polynomial 


approx (indx:1) = p(1)*(x(indx:i)).*%2 + p(2)*x(indx:1i) + p(3); 


errors = approx(indx:i) - fect (indx:i); 
& maxposerror = max(errors); 
SS maxnegerror = min(errors); 
5 3 % c_Odelta(i) = abs((abs(maxposerror) - abs(maxnegerror))/2); 


ole 
ole 


oe 
ale 





% If the negative error is bigger, then the delta should be negative 
if abs(maxnegerror) > abs (maxposerror) 


ale 
ole 


5% c_Odelta(i)= -1 * c_Odelta(i); 

SS end % if 

5 3 % approx (indx:1) = approx(indx:i) - c_Odelta(i); 

6S % errors = approx(indx:i) —- fect (indx:i); 
error = max(abs (errors) ); 


% If exceeded the max error, then go back to the previous endpoint 
if error > max_error 





i = i-1; 
endpt = x(i); 
c2 = c_2(i); 
cil. = c_l(i); 
c0 = c_0(i); 
return 


end % if error > max 
end % for i=indx+1l:length(fct) 


endpt = x(i); 





c2 = c_2(i); % Removed i = i-1; 
cl = c_l(i); 
c0 = c_0(i); 
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FILE: dec2binfp.m 








function [binfp] = dec2binfp(x,n) 


AP oP oO 


oe 


Created by Tom Mack 
Last modified: August 22, 2006 





ole 


oe 


This function converts a decimal number to a fixed point binary number 
with one integer followed by n points to the right of the decimal 








% Inputs 

% x = decimal number to be conv 
% n (optional, default 9) = bit 
% Outputs 


oe 


binfp = binary floating point 
Negative inputs are output in 


ale 


le 


if nargin < 2, n = 9; end 
if isnan(x) == 1, 
binfp = 
return 
elseif x == Inf 
binfp = Inf; 
return 
elseif x < 0, 
KX = (x * 24n) + 2%(2¥*n); 
x = dec2bin(x,18); 
x = str2double(x); 
x = x / 10%n; 
binfp = x; 
return 
else 





= Fe 2A? 

dec2bin(x,18); 

str2double (x); 

= x / 10%n; 
binfp = x; 

end 


x x xm XM 
ll 


rted (does not have to be an integer) 
resolution to the right and left of decimal pt 


representation 
18-bit (9.9) format 
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FILE: constQuadAppxWErr.m 
function [endpt,indx,c2,cl,c0O] = constQuadAppxWErr (x, fct,max_error) 





£2 


oe oe 


This function will produce multiple Quadratic-line approximations of a 
constant size of a given function to within the bounds of the 

max error provided. Coefficients & intercept calculated using polyfit. 
Intercept adjusted to balance max positive and negative errors. 

Created by Tom Mack for linear approximations 

Created: July 10, 2006 
Modified: July 11, 2006 
Modified again by Njuguna Macaria for Quadratic approximations 
Modified: Dec 30, 2006 


ole ol? 


oe 





oe 0 





oe 





ol? 


AP oP oO 


firstderiv = diff (fct)./diff (x); 
secndderiv = diff(firstderiv) ./diff(x(l:length(firstderiv))); 


[dermax,i] = max(abs(secndderiv) ); 
error = 0; 

loop_stop = 0; 

i_low =i-1; 


if i_low <= 0 
i_low = 1; 
end 


ihigh = i+ 1; 


if i_high > length (fct) 
i_high = length(fct); 





end 


%& If error is too small, increase until just under the max error 
% This gives the max size of the segment within the desired error 


while error < max_error || loop_stop < length(fct) 
i_low = i_low - 1; 
if i_low <= 0 
i_low = 1; 
end 


i_high = i_high + 1; 

if i_high > length(fct) 
i_high = length(fct); 

end 








ole 


Get coefficients, approximate function and find error 
Adjust function based on the error (move it up or down) 


oe 











p = polyfit (x(i_low:i_high),fct (i_low:i_high) ,2); 
approx (i_low:i_high) = p(1)*(x(i_low:i_high)).*2 + 

p (2) *x(i_low:i_high) + p(3); 
errors = approx(i_low:i_high) - fcet(i_low:i_high); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs((abs(maxposerror) -— abs(maxnegerror))/2); 
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% Figure out if the error is posivive or negative and move the function 
% to compensate and balance the error of the approximated function 
if abs (maxnegerror) > abs (maxposerror) 








c_Odelta = -1 * c_Odelta; 
end % if 
% Re-check th rror and find the max error 
approx (i_low:i_high) = approx(i_low:i_high) - c_Odelta; 
errors = approx(i_low:i_high) - fect (i_low:i_high); 
error = max(abs (errors) ); 


% If error is larger than should be 
if error > max_error 
i_low = i_low + 1; 
i_high = i_high -1; 
end 
loop_stop 
end 
segsize = i_high - i_low; 
consegs = ceil (length(fct)/segsize) ; 





loop_stop + 1; 











ole 


Determine Coefficients of segments 


ole 

















idx=1; 
for i = l:consegs 
indx(i) = round((length(x)/consegs) *i); 
if indx(i) == 0 
indx(i) = 1; 
end 
if i==consegs 
indx(i) = length(x); 
end 
endpt (1) = x(indx(i)); 
io) = polyfit (x (idx:indx(i)),fct (idx:indx(i)),2); 
approx (idx:indx(i)) = p(1)*(x(idx:indx(i))).*%2 + 
p(2)*x(idx:indx(i)) + p(3); 
errors = approx(idx:indx(i)) - fet (idx:indx(i)); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs(abs(maxposerror) - abs (maxnegerror))/2; 
if abs (maxnegerror) > abs (maxposerror) 
c_Odelta = -1 * c_Odelta; 
end % if 
c2(i) = p(1); 
cl(i) = p(2); 
c0 (i) p(3)- c_Odelta; % Constant shift to balance pos & neg error 
idx indx(i)+1; 
i = itl; 
end 
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Ge 
bE 


FILE: 





constantQuadApprox.m 





function [endpt, indx,c2,cl,c0] 


constantQuadApprox (x, fct, constsegs) 





This function will produce multiple Quadratic line approximations of a 
given function to within the bounds of the number of segments provided. 
Coefficients calculated by polyfit. Intercept adjusted to balance 
maximum positive and negative errors. 





Created by Tom Mack for linear approximations 

Created: June 4, 2006 

Modified for Quadratic approximations by Njuguna Macaria 
Modified: July 11, 2006 





1dx=1; 
for 1 = l:constsegs 
indx(i) = round( (length (x) /constsegs) *i); 
if i==constsegs 
indx(i) = length(x); 
end 
endpt (i) = x(indx(i)); 
io) = polyfit (x (idx:indx(i)),fct (idx:indx(1i)),2); 


approx (idx:indx(i)) = p(1)*(x(idx:indx(i))).*2+p (2) *x (idx: indx(i))+p (3); 
) 











errors = approx(idx:indx(i)) - fet (idx:indx(i)); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs((abs(maxposerror) -— abs(maxnegerror))/2); 
if abs(maxnegerror) > abs (maxposerror) 
c_Odelta = -1 * c_Odelta; 
end % if 
c2(i) = p(1); 
cl(i) = p(2); 
cO0 (i) p(3)- c_Odelta; % Intercept shift to balance pos & neg error 
idx indx(i)+1; 
i = itl; 


end 





a2 





A.2. QUADRATIC APPROXIMATION USING REMEZ ALGORITHM 

The thesis was designed using the Remez algorithm. The following files were 
developed to compute he segmentation. The top level file is QuadAppxRemz.m, which 
calls a set of user written MATLAB functions to display and request the user input 
(UserInput.m), obtain the numeric functions selected by the user and their respective 


domain intervals (getF.m) and then compute the segmentation. 


Non-uniform segmentation was performed by multipleQuadApprox.m in 
conjunction with varQuadApproxHyb3AvgThird.m and chebyRemz.m. chebyRemz.m 
takes place of Polifit.m that is an optimized user callable MATLAB function shown in 
A.1 above. 


Uniform segmentation is performed by two other files. If the number of segments 
is known without explicit input of ¢, then constantQuadApprox.m is the file that is used. 
If on the other hand, ¢ is defined and uniform segmentation is desired, then 


constQuadAppxWErr.m is the file that is used. 


The file twosComp.m was developed to convert the data to a two’s complement, 
fixed point binary, hexadecimal or decimal number. Note the two’s complement decimal 


number is not the same as a float or double data type. 
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QuadAppxRemz.m 








2090000 2°9°990090999099909999990999999099090990909909999999999999999999999999009 
COCO OO 600T0F05FO555FO56565O5565O5O55O5F5TO5HF65TT5HF6FTT5H5HFTTHFTTFHOTOHTTOHFHTOHFTTOHFCOFTOHFCTOHFTOFOHOO 


ole 
ole 
fo 


QuadAppxRemz.m 


Created: January 6, 2007 
Created by: Njuguna Macaria 
Last modified: Auguse 3, 2007 
Modified by: Njuguna Macaria 


This program produces a segmentation of a given function using either: 
1. Uniform Quadratic approximation 
2. Non-uniform piecewise Quadratic approximation 
3. Both 


It is based on the algorithm: 
1. For non-uniform, the MATLAB Remez algorithm 





2. For uniform, dividing the range of the input into 
equal, user-defined segments 
or by using max error to determine max segment length 
at the greatest curvature and then dividing the range 
up into equal segments. 




















Inputs 
Inputs are taken from an input function; "userInput();" 
N number of elements on which function is expressed 
eqn — (1)F(x)=ax*2+bx+c OR PIVOT: (2)F (x) =a(x-p) *2+b(x-p) +c 
x_low - low end of interval over which f(x) is evaluated 
x_high - high end of interval over which f(x) is evaluated 
func(x) - function to be evaluated 
epsilon - precision of approximation (for variable only) 
consegs -— number of segments to use to approximate (constant only) 
err_or_segs - Constant segmentation; decide # of segments or err bound 
vari_or_const - Variable or constant segmentation 
Outputs 
Segment info —- Segment #, Begin Pt, End Pt, Coefficients, & Error 








Plot showing the approximation 
Text file used to initialize memory in SRC (both Binary & Decimal) 








AP AP AP AAP AP AAP AP AP AP AAP IP IP AP AP OP AP AP AP OP AP IP IP AP AP OP AP AP AIP OP AP IP IP OP AP OP AP oP AIP OP 
AP NP AAP AP AAP AP AAP AP AP AAP AP AP AP AP AP OP AP CP ANP AP AP AP AP AP CP AP CP OP AP AP AP AP AP OP OP AP OP AP OP 











$$ SSSSSSSSSSS$SSSSS6SS INPUT OF USER-SPECIFIED PARAMETERS %3%%3%%%SS%S%S%SSS% 








ole 

















format long g; 


%& Get user input 

% profile on %& For use when debugging. Find runtimes 

1 = UsertInput (); 

, interval, vari_or_const,err_or_segs,consegs,epsilon,N]=getF (sel); 


th O 


$%% Based on the number of points to be used for the curve, find the 
66% x values to calculate and spread over the approximating function 
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eval(['func = ', f£, ';']) 

eval(['intv = ', interval, ';']) 

x_pts = linspace(intv(1), intv(2), N); 

vecFunc = inline (vectorize(func) ); S$Vectorized version of func. 
y_actual = vecFunc(x_pts); SEvaluate the function with x_pts 








(ous oan © Sas © Sal © as © Sal © Sal © al © Sal © a © Sal © Sal © al © Sal © Da © al © Sa © a © al © Sal © Sa © al © Sal © al © Sal © a © al oa © a oa oa oom) — agog0g0g0g0g0g0 gc g 00000 0 
$6656565656 565665655555 565556 SESSESSEES6ESES6S NOTES 65555665566 S56ES5SS 


ol? 
ol? 
ole 
ol? 
ol? 
ol? 
ole 
ol? 
ole 
ol? 
ol? 
oe 
ol? 
ol? 
ol? 





The segments in this program overlap (i.e. the first element of 
the NEXT segment IS the last element of the LAST segment. 





AP oP al al oP ol? 
AP AP WP oP oP Ol? 


ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ole 
ol? 
ole 
ol? 
ole 
ol? 
ole 
ole 
ole 
ole 
ole 
ol? 
fo) 
fe) 
oe 
ole 
oe 
oe 
oe 
ole 
oe 
oe 
oe 
oe 
fe) 
oe 
oe 


SPrint demarcation line 
PPLE (VM AH AR RAR RR RR REA ARR ROR BRK RRR RRA K RIK RR RRA ARK REA AK KOREN CO) 


CESS SSS SSSSSSEEEESEESEEEES Segmentation Algorithm SSSSSSSSSSSSSEEEEEEEEEEES 
repeat = 1; 
while repeat == 1 

if (mod(vari_or_const,2) == 1) 


[endpt, seg_end_point,c_2,c_1l,c_0O] =... 
multipleQuadApprox(x_pts, func, epsilon); 





end 
if (vari_or_const == 2) && (err_or_segs == 1) 
[endpt, seg_end_point,c_2,c_1l,c_0] =... 
constantQuadApprox (x_pts, vecFunc, consegs) ; 
end 
if ((vari_or_const == 2) && (err_or_segs == 2)) || (vari_or_const == 4) 
[endpt, seg_end_point,c_2,c_1l,c_0O] =... 
constQuadAppxWErr (x_pts, func, epsilon); 
end 


Pt TAT RRR RRR ARR AOR RAIL RDG EERIE AIRC CAS AAAS AR RAID AKA ASN ey) 


fprintf('\n\nBack from all the Segmentation\n\n') 
FE Ca ee Re RA RRR RK AKA KAR ROARK KAA ARE KARR KOREA RRR ARK RIOR KN HS) 


SSESSSSEEESSSECEESSSSEEESSECEESSESEEEESSEEESSSEEEESSESEEESSSEEEESSEEEESSEEEESSESEEEESS 
% Compute and plot function, approximate function and error % 
CSSESSSSECEESSSEEESSSCEEESSSEEESSSESEEESSECEESSSEEEESSESEEESSSEEESSSEEEEESEEEEESSEEEESS 
for 1 = 1l:length(seg_end_point)-1; 


%& looking at each segment find the approximate and actual points 


XP = x_pts(seg_end_point (1) :seg_end_point (i+1)); 
c = [c_2(1),c_1(i),c_0(1i)]; 

FNC = vecFunc (XP); 

FP = polyval(c,XP); 


Error = FP —- FNC; 




















MaxError(i) = max(abs(Error)); 
if (mod(i,100)==0) % Only used when trying to limit graphing 
if (mod(i,2) == 0) % Plot every other segment a different color 
figure (mod (vari_or_const,2)+1) 6% Blue 
plot (XP,FP) 
figure (mod (vari_or_const,2) +3) %% Blue 
plot (XP,Error) 
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else 


















































figure (mod(vari_or_const,2)+1) 
plot (XP,FP,'r', 'LineWidth', 2) S% Red 
figure (mod (vari_or_const,2) +3) 
plot (XP,Error,'r', 'LineWidth', 2) %%S Red 
end Sif (mod(i,2) == 0) 
figure (mod (vari_or_const,2)+1) 
hold on 
xlabel ('x', 'FontSize',10) 
ylabel('f(x)', 'FontSize',10) 
if (mod(vari_or_const,2) == 1) 
title([ 'NON-UNIFORM f(x)=',f,... 
" segmentation. No. of ',... 
"segments = ',... 
num2str(length(seg_end_point)-1),'.'],... 
'"FontSize',10) 
elseif (mod(vari_or_const,2) == 0) 
title([ "UNIFORM f(x)=',f,... 
"segmentation. No. of segments = ',... 
num2str(length(seg_end_point)-1),'.'],... 
'"FontSize',10) 
end 
figure (mod (vari_or_const,2) +3) 
hold on 
xlabel ('x', 'FontSize',14) 
errPwr2 = log2 (max (MaxError) ); 
ylabel(['Max Error = ',num2Str(max(MaxError)),' = 2\%',... 
num2str(errPwr2),'.'], 'FontSize',10) 
if (mod(vari_or_const,2) == 1) 
title([ 'Error for NON-UNIFORM f(x)=',f,... 
" segmentation. No. of segs = ',... 
num2str(length(seg_end_point)-1),'.'],... 
'"FontSize',10) 
elseif (mod(vari_or_const,2) == 0) 
title([ 'Error for UNIFORM f(x)=',f,... 
" segmentation. No. of segs = ',... 
num2str(length(seg_end_point)-1),'.'],... 
'FontSize',10) 
end 
end % if (mod(i,100)==0) Graphing STOP/START 





end for i = 1:length(seg_endpt) 
figure (mod (vari_or_const,2)+1) 


plot (x_pts,y_actual) & Plot func on same fig as piecewise approx 
stem(x_pts (seg_end_point),y_actual (seg_end_point) ) 

hold off 

SEEESSSSEEESES Decimal to Binary Conversion Algorithm SESESSESSESEEES 





ole 
ale 





ale 


% Print whether Uniform or Non-uniform 





oe 
ale 





if (mod(vari_or_const,2) == 1) 
fprintf('\n NON-UNIFORM Segmentation’) 
elseif (mod(vari_or_const,2) == 0) 


fprintf('\n UNIFORM Segmentation') 
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AP oP oO 


oe 


ole 


ole 


AP AAP NP NP AP AP AP AP AP AAP JP AP AP CP OP AP OP OP OP 


ole 


oP ole 


ole 


end 





oe 
fo\\e) 





ol? 
ale 


Convert to Twos Complement (32.32) 
and save ina file. 


ol? 
ale 








ae 
ol? 


fractLen = 32; 
intLen = 64-fractLen; 


ole 


32 bits to represent the fraction 
32 bits to represent the integer 





ol? 








ol? 
ale 


ol? 
ole 


Convert to Twos Complement (16.16) 
and save ina file. 


ol? 
ale 





oe 
ole 

















fractLen = 16; % 16 bits to represent the fraction 
intLen = 32 - fractLen; % 16 bits to represent the integer 
% BINARY FILE % 








ol? 
le 





fo) 


% Create text file of Binary values to initialize memory 


fid = fopen('memBIN.mem', 'w'); 
fprintf(fid,'sd', length(seg_end_point)-1); % Number of Segments 


fo) 


% Convert the values to binary and save in the file 
for i = 1:length(seg_end_point)-1 

















xban:Ci; =) = twosComp (x_pts (seg_end_point (i+1l)),intLen, fractLen); 
segmnt (1) = x_pts(seg_end_point (it+1)); % Used in next program 
c_2bin(i,:) = twosComp(c_2(i),intLen, fractLen); 
c_lbin(i,:) = twosComp(c_1(i),intLen, fractLen); 
c_Obin(i,:) = twosComp(c_0(i),intLen, fractLen); 
memBin = [ xbin(i,s)5' %,e -2bin (i, s)y* tess 

e-lbin (i; 3)" ';:c Obin: Ci,.%.).)¥. 


fprintf (fid,'\n%s',memBin) ; 


ole 


ol? 


fprintt <(fid;'\n"; xbin(i,2) 0" "pel2binid,.®).;. 
Y Veer bpadt et Vy elo bra (ys) 3 
end “for i = 1l:length(seg_end_point) 


ole 


fclose (fid); 





ol? 
ole 








% HEXADEDICMAL FILE 








ole 











% Create text file of Binary values to initialize memory 








fid = fopen('memHEXOx.mem', 'w'); 
Num_of_Segments = length(seg_end_point)-1; 
fprintf(fid,'%S6d', Num_of_Segments) ; % Number of Segments 





% for uniform segmentation, store a step siz 
if (vari_or_const == 2) || (vari_or_const == 4) 

step_len = Num_of_Segments/(intv(2) - intv(1)); % 

fprintf (fid,'\n0x%s', twosComp(step_len,intLen, fractlLen)); 
end 
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ole 


ole 


de Ae DM oP ol? alo 
Fl oe oe 


ole 


ole 


ole 


ole ol? 


oe 





fo) 


% Convert the values to binary and save in the file 
for i = 1:length(seg_end_point)-1 














xbin(i,:) = twosComp (x_pts (seg_end_point (i+1)),intLen, fractLen); 
segmnt (1) = x_pts(seg_end_point (it+1)); % Used in next program 
c_2bin(i,:) = twosComp(c_2(i),intLen, fractLen); 
c_lbin(i,:) = twosComp(c_1(i),intLen, fractLen); 
c_Obin(i,:) = twosComp(c_0(i),intLen, fractLen); 
memBin = PL Ox ban (ais jedag © yes 

[Ose 262 2a: (ey le Myatt’ 

[VOs! Gobi Gey 2 dep Sy Gad 

] 


['Ox',c_Obin(i,:) 
fprintf (fid, '\n%s',memBin) ; 
fprint® (tid)! \n" pxbind(iy st) ) © 2ec2bin Capt) eas 
oY re lpan (i. 34! 
) 


end for i = 1l:length(seg_end_point 


fclose (iad)? 








le 
ole 


[J] 


% DECIMAL FIL! 














ole 
ale 





% Create text file of Decimal Values to initialize memory 




















fid = fopen('memDEC.mem', 'w'); 
fprintf(fid,'s6d', Num_of_Segments) ; % Number of Segments 
%& for uniform segmentation, store a step siz 
if (vari_or_const == 2) || (wari_or_const == 4) 
step_len = Num_of_Segments/(intv(2) - intv(1)); % 
fprintf (fid, '\n%26.18f', step_len); %& Step size in Decimal 
end 
memDEC = [segmnt(l:end); c_2; c_1l; c_0] 
maxCoef = max(memDEC) ; 
minCoef = min(memDEC) ; 


fprintf (fid, '\n%26.18f %26.18f %26.18f %26.18f',memDI 
fclose (fid); 
End text file creation 





ea) 
Q 

~~ 

s 


oe 





fprintf('\n') 


EPLLNCL (ENA A RARER AAA AAR AAA AREER AR RACK RA AAR ER AREER AE ERR ARERR KN Ch) 


if vari_or_const ~= 3 
repeat = 0; 

end 

if vari_or_const == 3 
vari_or_const = 4; 

end 

% profile viewer 


pr = profile('info'); 
profsave (pr, 'profile_results') 





% End while repeat == 1 
% maxCoef = max (maxCoef) % for debugging to find number range 
6 minCoef = min(minCoef) 


nd file: QuadAppxRemz.m 
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A.2.1  Remez Algorithm With Chebyshev Initial Points 







































































FILE: chebyRemz.m 
function [poly_coeff, oscil, snd_Err] chebyRemz (fun, interval, order) 
DOO Oe OO OO OOO Oe ODDO O. De OOO OOO OOO OO. 2-9. OOOO 0-2 Oa 00 O20. OD. OOO OO 90-0 2 O00 0.2. 02:0 0-.0-' 2) 0-2. 000-0 2.0 0-2. oo 
0000000000000 000000000000000000000000000000000000000000000000 0000000000000 DO 
% % 
% chebyRemz.m % 
g g 
fo} fo} 
% Get chebyshev polynomial on the first iteration. Repeat for Remez % 
% application; User specifies the fuction to approxiamte. % 
% This programs turns the function provided into an inline function. % 
% % 
% INPUT: % 
% f: function entered by user (want to approximate this) % 
% However this function cannot be a constant. f must % 
% be only one variable. Must use the variable 'x'. % 
% order: order of approximation, e.g. 2nd order polynomial % 
% interval: range on which to get the coefficients will be % 
% approximated on the users function. % 
% OUTPUT: % 
% errRemz: error points for the range given % 
% poly_coeff: These are the coefficients of the polynomial that % 
% approximates the function. % 
% oscil: Oscillations on interval, for second order poly, we % 
% want only 2 oscillations. In this case oscillations % 
% are the zeroes of the first derivative. % 
% % 
% Author: Njuguna Macaria % 
% Created: 20 February 2007 Last Modified: 26 MARCH 2007 % 
ZLEGLILCLCLLCLLLLLLCLLLLLILCLLLLLCLLALLLLALLLLLLLLLLCLLLLLLLLLLLLALGALA8AAAAAAAAAAAAAAAD 
000000000000 00000000000000000000000000000000000000000000000 0000000000000 00H 
so so 
fo) fo) 
a = interval(l); 
b = interval (2); 
N = 500; %& Number of ments per segment 
x_pts = linspace(a,b,N); & X axis sample points 
y_act = fun(x_pts); % Evaluate actual function 
eps (-1).*[0O:order+1]; & Epsilon for coefficients calculation 
p_track = []; %& For tracking result with error 
% % 
so so 
fo) oO 
%& Estimate with Polyfit and get data % 
% % 
$3 3 S pp = polyfit (x_pts, y_act,order);% get polyfit coefficients 
6 &© &© & y_pfit = polyval(pp,x_pts); & evaluate with polyfit 
coefficients 
% %© %© % errPfit = y_pfit - y_act; % get polyfit error values to 
compare 








Repeat Powers of the polynomial i 
in (order +2) rows and get the 


JP dP oP 


n 


AP ol oO 





\O 
\O 











ale 


initial x points 








ole 




















set = ones (order+2,1)*([0:order+1]); 

xi (atb)/2 + (b-a)/2*cos((set*pi) / (orde 
J = 1; 

max_loops = 10; S$ Max |] 


ratio_error = 2; 


rt+1 





Ve 


Entering conditions for the loop. First loop is the chebyshev polynomial 


loops for Remez function 








Remez loop, 





however first set of coefficients ar 


ol? 


ole 





chebyshev coefficients 





































































































& Exit on these conditions: 1) Convergence 2) Greater than 9 iterations % 
% 3) If we have an exact quadratic to approx... % 
$% while (ratio_error > 1.00000001 || ratio_error < 0.9999999) && j<max_loops 
while j<max_loops 

& Extract set of initial points for evaluation (we'll use 4th column) 

% Next, evaluate the points on the actual function 

Np = [xi(1,1); xi(1,2); xi(1,3); x1i1(1,4)]; 

F = fun(N_p); 

% Raise x0, xl, x2, x3, to the respective powers 

A (xi').* (set); 

A(:,4) = eps'; 

6 Find Polynomial Coefficients % 

p = A\F; % 1st time = chebyshev coefficients 

p_track = [p_track,p]; % Records error 

% Remove err term; flip coefficients % 

pflip fliplr(p(l:end-1)'); 

poly_coeff = pflip; 

% Calculate Plot Values % 

y_apprx = polyval(pflip,x_pts); % evaluate with poly coefficients 

% Calculate the Errors, break loop if % 

6 1. function is already a Quadratic % 

% 2. If convergence has been reached % 

errRemz = y_apprx - y_act; 

max_Err = max (errRemz(2:end-1)); % Max error (exclude ends) 

min_Err = min (errRemz(2:end-1)); %$ Min error (exclude ends) 

if abs (max_Err) >abs (min_Err) % Set the return value of error 

snd_Err = abs (max_Err); 
else 
snd_Err = abs(min_Err); 
end 





100 











ole 


ole 


oe 


ole 


ole 


oe 


ole 





% (3) Exit loop if function == quadratic (very very small error) 
if abs(max_Err) < 2%-40 && abs(min_Err) < 2%-40 








oscil = 0; 
S$ 3 % plot_cheby (x_pts, y_apprx, y_act, y_pfit,errRemz,errPfit); 
break; % if exact polynomial is found!!! 
end 


% (1) Exit loop on convergence (previous error equal to present) 


compl=p_track(4,j); 
comp2=p_track (4, 3-1); 





if compl == comp2 
plot_cheby (x_pts, y_apprx, y_act, y_pfit,errRemz,errPfit); 
break; 

end % if compl == comp2 


end % if j>l 





ole 
le 





ale 


& Finding zeroes (Max & Min of error) 





\ 
ale 




















err der = diff(errRemz); %& Find difference between adjacent 

err_sign = sign(err_der); % points and determine the signs. 

err_sign = diff(err_sign); % Find difference between signs 

errZerl = find(err_sign == -2); % Yields either 2 or -2 where the 

errZer2 = find(err_sign == 2); % original function changed sign 
rrZeros = [errZerl,errZer2]; & Matrix of where sign changed 

% Exit Remez if too many Oscillations % 








ale 


% Provide Chebyshev Coefficients. 








oscil = length (errZeros); 
if oscil>order 
FPFINneEC’.e <) 


warning('Too many oscillations; Chebyshev Coefficients provided.') 





break; 





ole 
ole 





le 


% Use max errors and replace x values 











fo) fo) 
% % 
new_x2 = find(errRemz == max_Err); % Index of max error point 
new_x3 = find(errRemz == min_Err); % Index of min error point 


% Make sure to replace into the correct order on the range 





new_x2 = new_x2(1); % Incase there are multiiple 
new_x3 = new_x3(1); % pick the first element 
if new_x2 > new_x3 
xi(:,2) = atnew_x2/N* (b-a); 
xi(:,3) = atnew_x3/N* (b-a); 
elseif new_x2 < new_x3 
xi(:,2) = atnew_x3/N* (b-a); 
xi(:,3) = atnew_x2/N* (b-a); 





fo) 


end % end if new_x2 > new_x3 statement 
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oe 
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ole 
ol? 
ol? 
oe 








ratio_error = abs(max_Err)/abs(min_Err); 
ratio_err_track = [ratio_err_track,ratio_error]; 


ole 
oe 
oe 
oe 
oe 
oe ole 
oe 








ole ol? 


ole ole 


Plot actual vs the approx functions 





oe 
ole 








5% if mod(j,3)==1 || j==max_loops 

& % plot_cheby (x_pts, y_apprx, y_act, y_pfit,errRemz,errPfit); 
SS figure 

SS plot (x_pts,errFuncP) 

% end © end if mod(j,3)==1 || j==max_loops statement 


ole 
ole ol? 


& % track) = [trackj, j]; 
j=jtl; 
end Swhile loop 


ole 
ale 
ole 


format long; 
ratio_err_track 
p_track 

track) 

format short; 


oe 
oe 
oe 


oe ole 
ale oP 
ale oP 


oe 
oe 
oe 
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A.2.1 Variable Length Approximation Speed-Up Algorithms 


The following files are the programs used to speed up the segmentation. 6 are 
presented here. The first file is the file that is used for segmentation. The others are 
available for the purpose of comparison. Only the first file is complete, the other files 
only show the code that is different from the first one i.e. the middle of the file that 


searches out the width for segmentation. 


a. Hybrid of 3 estimates, average and thirds 





FILE: varQuadApproxHyb3AvgThird.m 








function [endpt,i,p,data_] =... 
varQuadApproxHyb3AvgThird(x_pts, f3der,est_max_len, fct, epsilon, indx) 


% varQuadApproxHyb3AvgThird.m % 


6 This function creates a 2nd Order polynomial approximation of a given 
%& function using the Remez algorithm. It continues to calculate Remez 


% approximations until epsilon is exceeded. % 
% % 
% Remez approximations (with first approximation being a chebychev % 


% polynomial approximation). 


% To reduce the loop time, we first approximate the length of the % 
% proposed segment. We take 3 estimates, at the beginning, end and % 


% middle. Take the average of these 3. Then evaluate all the points % 
% on the proposed length and get set of estimated lengths. 
% Take the average of all these estimates. This is the proposed length 

















% to be used. % 
% INPUT: % 
% fect: function entered by user (want to approximate this) % 
% However this function cannot be a constant. f must % 
% be only one variable. Must use the variable 'x'. % 
% x_pts: All the x-axis points on which to evaluate the % 
% function. % 
% indx: index at which to start the interval of x values % 
% epsilon: maximum error that the user wants to limit the % 
% approximated function. % 
% OUTPUT: % 
% endpt: end point of the segment % 
% i: Index at which we stopped the function approximated % 
% p: coefficient for polynomial approximation % 
% p(1l) is the x*2 coeff, p(2) is the x coeff and % 
% p(3) is the constant term in the 2nd order poly % 
%& Modified by Njuguna Macaria % 
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2292909 2g 















































syms x 
order = 2; % Set the order of the polynomial 
errstop = 0; % To to s if w xceeded epsilon 
loopt = 1; % track times Remez is called 
data =~ i [als 6 Final loop count accumulated 
x_ptsRange = x_pts(end)-x_pts(1); % Basically (b-a) 
start_interval = x_pts(indx); & Start of this segment interval 
EESESSSSEEEEEEEEEEEEEEES ESITMATION CESCSSCSCSSSEEEEEEEEEEEEEEEEEES 
EESESSEEEEEESESSESES Using Average after 3 Est SESESSEEEEESESSESEEEESSESEESES 
abs_f3der = abs(f3der(start_interval)); 
if abs_f3der == 0 
len = round(.086*length(x_pts)); % Close, but ends up being increased 
else 
x_rangel = 4* (epsilon*3/abs_f3der)%* (1/3); 
lenl = round(x_rangel/ (x_ptsRange) *length(x_pts)); 
if lenlt+indx > length(x_pts) 
len = length(x_pts) - indx; 
else 
abs_f3der= abs (f3der(x_pts(indx+lenl))); 
if abs_f3der == 0 
len = round(.086*length(x_pts)); 
else 
x_range2 = 4* (epsilon*3/abs_f3der)%* (1/3); 
len2 = round(x_range2/ (x_ptsRange) *length(x_pts)); 
len_mid = round((lenlt+len2)/4); 
abs_f3der= abs (f3der(x_pts (indx+len_mid) )); 
if abs_f3der == 0 
len = round(.086*length(x_pts)); 
else 
x_range3 = 4* (epsilon*3/abs_f3der)%* (1/3); 
len3 = round(x_range3/ (x_ptsRange) *length(x_pts)); 
len = round((lenl+len2+len3) /3); 
end 
end 
if lentindx > length(x_pts) 
len = length(x_pts) - indx; 
end 
Der3Intr = f3der(x_pts(indx:indxtlen)); % Get third derivatives 
AV3DER = mean (Der3Intr); % Average them all 
x_range = 4* (epsilon*3/abs (AV3DER))*(1/3); % Get new X_range value 
len = round(x_range/ (x_ptsRange) *length(x_pts)); % Best len 
if lentindx > length(x_pts) 
len = length(x_pts) - indx; 
lseif len > est_max_len*10 %& When 3rd Derivative is small 
len = est_max_len; 
end 


end 
end 
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AP 0 ol? 
AP ol ol? 
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AP oP ol 
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AP oP ol 
AP lO NP 
AP ol ol? 
AP ol oN? 
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LOOK 


ifa 


JP oP oo 
AP ol oN? 
AP oP ol 
AP ol ol? 
AP oP oo 
AP ol ol? 
AP oP ol 
AP ol ol? 
JP oP ol 


mm ol oP lo 


a 








errP; 


bs_f3der | 


LOOK < 








HL 0 oP leo 
AP ol ol? 
AP oP oO 
AP ol ol? 
JP oP ol 
AP ol oN? 
AP ol oN? 
AP ol ol? 
JP oP ol 
AP ol ol? 
AP ol oN? 
AP oP ol? 
AP oP ol? 
AP oP ol? 
AP oP ol? 


~e 





max_Perr/epsilon; 


0.9 || LOOK > 1.002 








Finda 


AP oP ol 


good place to start indexing 








if abs_f3der 
while 
len 


0 


(max_Perr > epsilon) 


AP oP ol 


&& len 


ceil (len/3); 


if lentindx > length(x_pts) 


len 
break; 


end 
interval 





length 


rrP ] 


x_pts) - indx; 


start_interval, x_pts (indx+len) ] 





[p,oscil, 
max_Perr 
loopt 


2 


° 


end 
increment! 


while max_Perr > epsil 





Len = 
else 





incrementLen 


len; 


il(1 


chebyRemz (fct, interval,order) ; 
errP; 
loopt 





Hl; 
Lon 





n*.05); 





Li 


6 if abs_f3d 


© 


end 


while incrementLen 


incrementLen 





Cc 
[Bg 


> 2 
ceil 





incrementLen/3); 


while (max_Perr < epsilon) && len > 2 























len len + incrementLen; 
if lentindx > length(x_pts) 
len = length(x_pts) - indx; 
break; 
end 
interval = [start_interval, x_pts (indx+len) ] 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
max_Perr = errP; 
loopt = loopt +1; 
end % while max_Perr > epsilon 
incrementLen = ceil (incrementLen/3); 


while 
len 
interval 





(max_Perr > epsilon) 


&& len > 2 





n incrementLen; 
start_interval, x_pts (indx+len) ] 





p,oscil, 
max_Perr 
loopt 





rrP ] 


chebyRemz (fct,interval,order) ; 
6rrP; 
loopt +1; 





if incrementLen < 2 


break; 
end 


t 


, 


, 
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fe) 


end % max_Perr > epsilon 


fo) 


end % end while incrementLen > 2 


























end % if 
ZLELGLCLCLCLLLCLCLLLCLCLCLLLLLLLLLAIILALLLLCAAAGALAAGAAAALAAAILALAAAAAAAAAAAAAAAAAAAAAAAAAA2AD 
000000000000 000000000000000000000000000000000000 0000000000000 0000000000000 DO 
SESESSEEEEEEES SEE EESEESSESESEESES PINPOINT SESEESEEEEEEESESESEEEEESESSEEESES 
ZLELGLCLCLCLLCLLLCLLLLCLCLLLCLCLLLLLLLLLLLLALLLALALLCLLLLAAIILLLALLAALAAAAAAAAAAAAAAAAAA2D 
000000000000 000000000000000000000000000000000 000 0000000000000 0000000000000 DO 
% % 
% Step from indx + len % 
so so 
co} fo) 
if max_Perr > epsilon % Since we exceeded, go backwards 
1 = indx+len; & Jump to the estimated length 
errStop = 2*epsilon; % Increase to prevent premature stop 
while i < length(x_pts) 
if errStop < epsilon 
i = itl; & This was the point evaluated before 
endpt = x_pts(i); % the decrement at the end of this 


% while loop. Restore index i and all 
% associated data. 

fid = fopen('CompareLoop.txt','a'); 

data_ [data_ loopt]; 


Der3Intr = f3der(x_pts (indx:indxtlen)); 





AV3DER = mean(Der3Intr); 
forintf (fid, '\n%4d S4d len: %5d i: %5d areas 


"avg:%10.5f LOOK: %8.6f MORE',... 
i,loopt, len, i-indx, AV3DER, LOOK); 
Felose (fid); 

















return 

end 
loopt = loopt + 1; 
interval = [start_interval, x_pts(i)]; 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
errstop =i8TnrPy 
al =i -l1; 

end 

else 
for i=indx+len:length(x_pts) % Since we were short, go forward 


6 First time thru, skip this if statement 
If exceeded the max error, then go back to the previous endpoint 
if errStop > epsilon 

i = 1-2; % Get back to within Error 





oe 











endpt = x_pts(i); 

interval = [start_interval, x_pts(i)]; 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
fid = fopen('CompareLoop.txt','a'); 
data_ = [data_ loopt]; 


Der3Intr = f3der(x_pts (indx:indxtlen)); 








AV3DER = mean(Der3Intr); 
fprintf (fid, '\n%4d %4d len: %5d i: %5d Vie ds 


"avg:%10.5f LOOK: %8.6f SS yess 
i,loopt, len, i-indx, AV3DER, LOOK); 











E 
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fclose (fid); 














return 
end % if error > max 
loopt = loopt + 1; 
interval = [start_interval, x_pts(i)]; 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
errStop = errP*1.05; % reduces the iterations 
end 
end % max_Perr > epsilon...... % for 1=indx+1:length(fct) 
fid = fopen('CompareLoop.txt','a'); 
data_ = [data_ loopt]; 


fprintf (fid, '\n%4d %S4d',i, loopt); 
fclose (fid); 

endpt = x_pts(i); 

% END OF FILE: varQuadApproxHyb3AvgThird.m 
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Binary Search 


b. 
varQuadApproxBinSearch.m 
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1, 
, 


LA 


1, 


LA 
, 


1, 
£ 


,X_pts (indxtlen) ] 
,X_pts (indxtlen) ] 


, 
= chebyRemz (fct, interval, order) 


,X_pts (indxtlen) ] 


= chebyRemz (fct, interval, order) 
indx 


, 





incrementLen 


incrementLen 
start_interval 


&& len > 1 
if lentindx > length(x_pts) 
&& len > 1 


= chebyRemz (fct, interval, order) 


T 
psilon 


&& len > 2 
tel 





n 
n 


psilon 
start_interval 


loop 




















psilon) 
psilon) 


start_interval 


round (incrementLen/2) 
round (incrementLen/2) 


psilon 
incrementLen > 2 


[ 


psilon) 


rrP | 


= length(x_pts) 
rrP | 


ig 








, 








interval 
p,oscil, 
max_Perr 


interval 


p,oscil, 
max_Perr 





rrP ] 





break 
break 


len 
max_Perr >e 


max_Perr <e 





interval 


p,oscil, 


max_Perr 


loopt 


fo) 


( 
( 


while max_Perr >e 


Len 
[ 





len 

end 

ix) 

len 

loopt 

if incrementLen 
end 

% max_Perr >e 





max_Perr >e 


( 


len 
incrementLen > 2 


while max_Perr >e 


end whil 








incrementLen 


loop 

while 

end 
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while 

end 
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Thirds 


varQuadApproxTHIRD.m 
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= chebyRemz (fct, interval, order) 
,X_pts (indxtlen) ] 
= chebyRemz (fct, interval, order) 


,X_pts (indxtlen) ] 


, 
= chebyRemz (fct, interval, order) 
indx 


, 





psilon 
incrementLen 


indx 

&& len > 2 

incrementLen 

&& len > 2 
start_interval 


i: bale 


‘i 


&& len > 2 
+1 


psilon 





n 


start_interval 
n 


loop 





, 
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start_interval 


round (incrementLen/3) 
round (incrementLen/3) 


errP 
psilon 
incrementLen > 2 
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break 
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if incrementLen < 3 
$ max_Perr >e 
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break 
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end 
loopt 
end 
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while max_Perr >e 
incrementLen > 2 


if lentindx > length(x_pts) 
end whil 








incrementLen 


len 
end 

[ 

Q 
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while 
end 
incremen 
while 
end 
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[o} 


while 

end 
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end 
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FILE: varQuadApproxRatio.m 
29999990900999990909009099.999990090099999909009099.9999090909999990990999999999999999999909 
0000000000000 0000000000000000000000000000000000000000000000000000000000000 0 
foal oan © al © Sak © al © Sal © al © al © Sal © Sal © al © Sal © a © al © al © al © al © a oa © alo oa ooo oa) [oan San oan © Sam 0a oul © Sak © al © Sal © al © Sal © Sal © ak © al © bal © sal © al © ak © al © al oak © al oval oa oa oom ooo ooo) 
SECC CEC EEC SEC SCC SE SCEE CEE SES RATIOS SS SSSSCCC CESS SSC CECE SEES SECC EESES 
299999990999999090000999999909099999990909999999909090999999900999999999909999999009 
0000000000000 0000000000000000000000000000000000000000000000000000000000000 70 
len = length (x_pts) -indx; 
max_Perr = 100; 

LOOK 0; 
2 2 
fo) fo) 
%& Find a good place to start indexing % 
% % 
while (max_Perr > epsilon) && len > 2 
len = floor(len/3); 
interval = [start_interval,x_pts (indx+len) ]; 
p,oscil,errP] = chebyRemz(fct,interval,order) ; 
max_Perr = errP; 
loopt = loopt +1; 
end 
while (max_Perr < epsilon) && len > 2 
len = ceil (len*1.2); 
if lentindx > length(x_pts) 
len = length(x_pts) - indx; 
break; 
end 
interval = [start_interval, x_pts(indx+len) ]; 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
max_Perr = errP; 
loopt = loopt +1; 


fo) 


end %© max_Perr > epsilon 











while (max_Perr > epsilon) && len > 2 
len = floor(len*.95); 
interval = [start_interval,x_pts (indx+len) ]; 
p,oscil,errP] = chebyRemz(fct,interval,order) ; 
max_Perr = errP; 
loopt = loopt +1; 








end %© max_Perr > epsilon 








while (max_Perr < epsilon) && len > 2 
len = ceil (len*1.01); 
if lentindx > length(x_pts) 
len = length(x_pts) - indx; 
break; 
end 
interval = [start_interval,x_pts (indx+len) ]; 
[p,oscil,errP] = chebyRemz(fct,interval,order) ; 
max_Perr = errP; 
loopt = loopt +1; 





fo) 


end %© while max_Perr > epsilon 


&& len > 2 
len*.999); 
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while (max_Perr > epsilon) 
len = floor 

















r 


1, 


x 


,X_pts (indxtlen) ] 


" 
= chebyRemz (fct, interval, order) 


,X_pts (lentindx) ]; 


indx 
= chebyRemz (fct, interval, order) 


lA 


+1 


psilon 





start_interval 


loopt 





, 


[start_interval 
errP 


length (x_pts) 


rrP] 





, 





interval 


p,oscil, 


max_Per 
rrP ] 





len 
break 
ase 








interval 


p,oscil, 


max_Perr 


loop 
while max_Perr >e 


if lentindx > length(x_pts) 


end 
[ 
end % 
end 


[ 
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but ends up being increased 


‘3 


Close, 


Ve 


, 
, 


,X_pts (lentindx) ]; 


indx 
= chebyRemz (fct, interval, order) 


gel/ (x_ptsRange) *length(x_pts))j; 


if lentindx > length(x_pts) 


start_interval 


xX ran 
, 


start_interval 


[ 


4* (epsilon*3/abs_f3der)%* (1/3); 
errP 


round ( 
= length(x_pts) 


abs (f3der ( 
round(.086*length(x_pts) ) 


rrP ] 





len 





interval 


len 

len 

end 
p,oscil, 





if abs_f3der 
else 

end 

max_Per 
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2 estimates 
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but ends up being increased 


- 


1, 


, 


Close, 


Ve 


, 


ge2/ (x_ptsRange) *length(x_pts)); 


*3/abs_f3der)%*(1/3); 
= round((lenl+len2) /2) 


,X_pts (lentindx) ]; 


indx 
= chebyRemz (fct, interval, order) 





epsilon 
round ( 


gel/ (x_ptsRange) *length(x_pts)); 


if lenlt+indx > length(x_pts) 


, 


xX_ran 


xX_ran 


4~ ( 





, 


4* (epsilon*3/abs_f3der)%* (1/3); 
abs (f3der (x_pts (indxtlenl) ) ) 


round ( 
start_interval 


est_max_len 


errP 


[ 


f3der(start_interval 


abs ( 
= length(x_pts) 
F3der 
if abs_f3der 
len 
len 





rrP ] 





len 
abs 
else 
end 





1 
interval 
p,oscil, 


max_Perr 


len 
else 
end 
if lentindx > length(x_pts) 
end 





[ 


if abs_f3der 


else 
end 
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is small 





Y)); 


eg 


but ends up being increased 


, 


1, 


LA 


Close, 
indx+len_mid 
When 3rd Derivative 


% 


ge3/ (x_ptsRange) *length(x_pts)); 


Ve 
*3/abs_f3der)*(1/3); 
round ((lenl+len2+len3) /3) 


id 


ge2/ (x_ptsRange) *length(x_pts)); 


round ((lenl+len2) /4) 


abs (f3der (x_pts ( 





epsilon 
round ( 


*3/abs_f3der)%* (1/3); 


xX_ran 


, 


,X_pts (lentindx) ] 


indx 





, 


gel/ (x_ptsRange) *length(x_pts)); 


if lenlt+indx > length(x_pts) 
len 


epsilon 
round ( 


xX_ran 


, 


4~ ( 


xX_ran 


4~ ( 


st_max_len*10 


est_max 





4* (epsilon*3/abs_f3der)%* (1/3); 


abs (f3der (x_pts (indxtlenl) ) ) 


if abs_f3der 
est_max_len 


round ( 
start_interval 


est_max_len 


d 
3der= 
if abs_f3der 


Len 


[ 


= chebyRemz (fct, interval, order) 


errP 


f3der(start_interval 
len 





= length(x_pts) 
3der= 

len 

abs 

else 

end 

len 





abs ( 
len_mi 





rrP ] 





if lentindx > length(x_pts) 
lseif len > 





len 
abs 
else 
end 
end 





1 


len 
else 
interval 
p,oscil, 
max_Perr 





if abs_f3der 
else 
end 
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A.2.2. Non-Uniform Quadratic Approximation 


This is the file that keeps track of the segments computed and the associated 


endpoints 


and coefficients. 


The data 


is 


sent back to the main function, 


QuadAppxRemz.m. From we call varQuadApproxHyb3AvgThird.m or any of the other 


varQuadApprox* files depending on which one we want to use. 


FILE: 





multipleQuadApprox.m 








function [endpt, indx 


,c2,cl,c0] 


= mul 












































2g 


2g 


2g 


This function will produce multiple Quadratic-line approxima 
given function to within the bounds of max error provided. 


2g 


2g 


t 


2g 


2g 


i 


2g 


2g 


2g 


2g 


ltipleQuadApprox (xpts, fct, epsilon) 


229 io 


£ must 
xe 


ooo 


% Created: January, 2007 

% INPUT: 

% fect: function entered by user 

% However this function cannot be a constant. 

% be only one variable. Must use the variable 

% xpts: All the x-axis points on which to evaluate the 
% function. 

% epsilon: maximum error that the user wants to limit the 
% approximated function. 

% OUTPUT: 

% endpt: end point of the segment 

% indx: Array of all the index endpoints 

% c2: Array of the x*2 polynomial coefficients 

% cl: Array of the x polynomial coefficients 

% cO: Array of the constant terms in the 2nd order poly 
% Modified: July 2, 2007 

syms xX 

format compact 

a = 1; 

seg_no a 

endpt = []; 

C2 Sie 

cl = []; 

c0 = (1; 

% Find Max length Estimate. Will be % 

S$ used if third derivative = 0, or if % 

6 it's really small (NOT YET IMPLEMETED) % 

fct_vec inline (vectorize(fct)); 

abs_f3der = abs (diff (diff (diff(fct)))); 





abs_f3der_vec 


inline (vectorize(abs_f3der)); 


2g 


(want to approximate this) 


2g 


2g 


2g 


2g 


AJP AP AP AP AP AP AP AP AP WP CP WP AP AP AP AP AP OP OP AIP O'? 


\) 


2g 
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f3der_pts = abs(abs_f3der_vec(xpts) ); 






































abs_f3der_max= max(f3der_pts); % Absolute Max 3rd derivative 
x_ptsRange = xpts(end)-xpts(1); 
xpts_min_seg = 4* (epsilon*3/abs_f3der_max)* (1/3); & smallest seg width 
min_seg_len = round(xpts_min_seg/x_ptsRange*length(xpts) ); 
xpts_avg_seg = 4*(epsilon*3*x_ptsRange/... 

quadl (abs_f3der_vec, xpts(1),xpts(end)))*(1/3); 
avg_seg_len = round(xpts_avg_seg/x_ptsRange*length(xpts)); 
est_max_len = 2*avg_seg_len - min_seg_len; 
& If the fucntion is sqrt(-log(x)), then make est_max_len the max size. 
% est_max_len calculated is not as large as the larger segments and will 
% slow down the program because of small estimates...Therefore: 
if fect == sqrt (-log(x) ) 

est_max_len = length(xpts); 
end 
% Sometimes the estimates are short. To prevent this from affecting the 
% program... est_max_len is increased * 10 
$ est_max_len = 10*est_max_len; 
% Get the values for each segment and % 





ole 
le 


store them in the return vectors 








indx(i)= 1; %& To include the first element, offset length by 1 


while i < length(xpts) 
[endpt (seg_no),indx(seg_no+1),polyCoeff] = 
varQuadApproxHyb3AvgThird (xpts,abs_f3der_vec,... 
est_max_len, fct_vec,epsilon,i); 








c2(seg_no) = polyCoeff(1); 
cl(seg_no) = polyCoeff (2); 
c0O(seg_no) = polyCoeff (3); 

a = indx(seg_not+l); 
seg_no = seg_no + 1; 


end 

fprinte ( \A\AR Ss eee ee eee ARK Ed of Segmentation******k KKK KKK KKK \! ) 7 
avg_seg_len 

min_seg_len 

est_max_len 

Seg_lengths = diff (indx) 
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A.2.3 Uniform Quadratic Approximation 





FILE: constantQuadApprox.m 








function [endpt,indx,c2,cl,c0O] = constantQuadApprox(x_pts, fct, constsegs) 


ale 


oe 


This function produces multiple Quadratic approximations of a 
given function to within the bounds of the number of segments provided. 
Coefficients calculated by Remez. 


JP ol ol? 


ole 


Created by Tom Mack for linear approximations, using polyfit 
Created: June 4, 2006 

Modified for Quadratic approximations using Remez by Njuguna Macaria 
Modified: July 11, 2006 


AP oP ol 


oe 








syms x 
order = 2; 
indx(1) = 1; 
for i = l:constsegs 

indx(i+l) = round((length(x_pts)/constsegs)*i); % each iteration set seg 
size 

if i==constsegs 

indx(it+tl) = length(x_pts); 

end 

endpt (1) = x_pts(indx(it+l)); 

interval = [x_pts(indx(i)),endpt (i) ]; 

[p,oscil,errP] = chebyRemz(fct,interval,order) ; 

e2(i) = pl); 

el(i) = p(2); 

cO(1) = p(3); 

ap = itl; 
end 
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A.2.4 Uniform Quadratic Approximation with Constraints 








FILE: constQuadAppxWErr.m 











function [endpt,indx,c2,cl,c0O] = constQuadAppxWErr (xpts, fct, epsilon) 


oe 


ole 
Q 
HEE 


This function will produce multiple Quadratic-line approximations 
constant size of a given function to within the bounds of the max 
provided. Coefficients & intercept calculated using Chebychev and 
algorithm. 


AP Al oO 


oe 











% INPUT: 

% fect: function entered by user (want to approximate this) 
% However this function cannot be a constant. f must 
% be only one variable. Must use the variable 'x'. 

% x_pts: All the x-axis points on which to evaluate the 

% function. 

% indx: index at which to start the interval of x values 

% epsilon: maximum error that the user wants to limit the 

% approximated function. 

% OUTPUT: 


ole 


endpt: end point of the segment 
c2: Coefficients of x*2 in quadratic polynomial 
cl: Coefficients of x in quadratic polynomial 
cO: Constant of quadratic polynomial 





oP ol? 





AP AP AAP AP AP AP AP WP WP AAP AP AP AP AP OP OP OP OP 


ole 





% Compute # of seg 
Njuguna Macaria Date: 


YO2AXUAAAIAAAIAAAAD 


AJP oP oP ol? 


re) 
\) 











Find Min length Estimate. Will be 











% the limiting length for uniform % 

% implmentation % 

fct_vec = inline(vectorize(fct)); % vectorize fct (for eval) 
abs_f3der = abs (diff (diff(diff(fct)))); % symbolic 3rd derivative 
abs_f3der_vec = inline(vectorize(abs_f3der)); % vectorize for evaluation 
f3der_pts = abs(abs_f3der_vec(xpts) ); % evaluate to form vector 
abs_f3der_max = max(f3der_pts); % abs (Max 3rd derivative) 
x_ptsRange = xpts(end)-xpts(1); % Find length of x-domain 
xpts_min_seg = 4* (epsilon*3/abs_f3der_max)* (1/3); % smallest domain len 
est_min_seglen = floor(xpts_min_seg/x_ptsRange*length(xpts)) %in index pts 








% 
% 


Find where this happens in the domain 


AP Al oO 





ole 





IndxofMax = find(f3der_pts == abs_f3der_max);% Find min len on domain 
numoftimes length (IndxofMax) ; How many are there? 


ole 








oe 
ale 


ole 


Test begin, End and Midddle 





ole 





oe 
ale 
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est_max_len 


length (xpts); 


if numoftimes > 1 


1 
else 

i 
end 
IndMaxtmp 


ak . 


1, 


IndxofMax (1); 


IndxofMax; 


if i > length(xpts) 





lenMid 


—- est_min_seglen 


lenBegin = est_min_seglen; 





ole 


oe 


ole 


dummy variable 
more than 1 Max point 
i at begin of est seg 


default 
The new IndexofMax 
Check if truncated 


segement then fix 
both these estimates 


est_min_seglen; 


2 o2o9o20 08 92 @2@ @ @ @ @ @ @e@eeeoeweeEeee og 
CC CO OC OCC OC OC COO OS OC CC OS SC COO SG CG OO SS 


Begin with the index of the highest 3rd derivative 
endpt, indx,p] = varQuadApproxHyb3AvgThird( xpts,... 





lenBegin 


abs_f3der_vec, 
est_max_len, 
fct_vec, 
epsilon,... 
1); 


2 o2o 9909 9 92 @ @ @ 2@ 
OO: OO FO. OS OO OS OS OSS CO 


% index of the highest 3rd derivative in the middle 
aL = IndMaxtmp - floor(est_min_seglen/2);% is at end of est 
i ee eee % Check to make sure not indexing before begin 
Se iy % of interval, if so start at begin of interval 
end 


[endpt, indx,p] 


lenMid 


oa og o90Q00 2 x fo) 


o 


° 
6 6 


varQuadApproxHyb3AvgThird( xpts,... 
abs_f3der_vec, 
est_max_len, 
fct_vec, 
epsilon,... 
1); 

indx -—-1; 


end with the index of the highest 3rd derivative 


a = IndMaxtmp —- est_min_seglen; %& is at end of est seg 
Sie de oh % Check to make sure not indexing before begin 

de = ly % of interval, if so start at begin of interval 
end 


[endpt, indx,p] 





lenEnd 


ol? 


= varQuadApproxHyb3AvgThird( xpts,... 


abs_f3der_vec, 
est_max_len, 
fct_vec, 
epsilon,... 
i); 


Inds. = ds; 








ale 


seg 
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% Find of required segments on domain % 








ole 




















min_seglen = min([lenBegin, lenMid, lenEnd,est_min_seglen]); 
numberOfSegs = ceil(length(xpts)/(min_seglen-1)); % Go large, figure # segs 








le 


% Reuse the function to calculate data 


ole 








— oe 
ole 


endpt, indx,c2,cl1,c0O] = constantQuadApprox(xpts, fct_vec, numberOfSegs) ; 
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A.2.5 Fixed-Point Decimal to HEXADECIMAL or BINARY 


FILE: twosComp.m 











Sfunction [hexX,decX,binX] = twosComp (x, intLen,mantisaLen) 

function hexX = twosComp (x, intLen,mantisaLen) 
2OR9U2XOAAAAAAAAAAAAAIAAAAAAIAAI28A9929938929 9999999 9-999-999 9999989939929 9999 999.9 9 9 
00000000000 000000000000000000000000000000000000000000000000000000000000 7070 
twosComp.m 


This function converts any decimal number to a two's complement binary 
fi object. 

















function [hex, decX, binX] = twosComp (x, intLen,mantisaLen) 
Input: x3 The value to be converted 
intLen: User desired length of the integer portion of 
the number. How many bits are in the integer. 
mantisaLen: The length of the mantissa. The number of bits 
in the fraction section, the precision. 
Output: decx: Decimal value as fi object. Integer and 
fraction as decimal representation. 
binxX: Two's Complement of the input x. With integer 





portion represented with "intLen" bits and the 
fraction portion represented with "mantisaLen" 
bits. 

hexxX: Two's Complement of the input x. Represented 
as a Hexadecimal value. 


This function auto-aligns the decimal point. 


Created by: Nijuguna Macaria 
Date: 10 May 2007 


AP AP AAP AP AP AP AP AAP AP AAP OP AP AP CP AP AP WP AP AP DP CP AP OP OP AP AP AIP OP 


AP AP AAP AP AAP AP AAP AAP AP WP AP AAP CP OP AP AP OP AP AAP AAP CP AAP AP AP OP OP OP OP 


le 
ole 
ole 
ole 
ale 
ole 
ale 
ole 
ole 
ole 
ole 
ol? 
ale 
ole 
ole 
ol? 
le 
ole 
ale 
ol? 
ale 
ole 
ale 
ole 
ale 
ole 
ole 
ol? 
ole 
ole 
ole 
ol? 
ale 
ole 
ole 
ol? 
ole 
ole 
le 
ol? 
ole 
ole 
ale 
ol? 
ale 
ole 
le 
ol? 
ale 
ole 
ole 
ol? 
ale 
ole 
ale 
ole 
le 
ole 
fe) 
oe 
oe 
ole 
oe 
oe 
oe 
ole 
oe 
oe 
oe 
ole 
ol? 
fo) 
oe 


totalLen = intLentmantisaLen; % Total bits desired to represent the nbr. 
if totalLen >128 

warning('Max Precision: 128bits. You have requested > 128 bits"); 
end 








fi Object: two's complement 








AP Al oP 
oP oP ol? 


decX = fi(x,1,totalLen,mantisaLen); % Create fi object, display decimal 
binx = bin(decxX); %& Save and return a binary form 
hexX = hex (decX) ; 

deciM = dec (decxX); 








Quantizer: two's complement 


AP oP NP 








AP JP WP oP oP AAP ol? 


$3 q = quantizer('fixed', 'nearest', 'saturate', [totalLen mantisaLen] ) 
& % [a,b] = range (q) 

6 %$ binX = num2bin(q,x) 

% %© decX = bin2num(q,b) 
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A.2.6 User Interface and Function Information Files 

































































FILE: UserInput.m 
function select = UserInput () 
format long g 
fprintf('\n\n' ) 
Forintf WI RRR IAD AR ARI I NI BIE REA ER HORN I RRR IAEA ISI I IS OA ICIS AG ORE RIN BAA I ISIE ADE A RAC MOKA Re AK) 
fprintf('\n\n' ) 
Fprintf('\n QUADRATIC APPROXIMATION OF A FUNCTION USING CHEBYSHEV') 
fFprintf('\n AND REMEZ AI1GORITHM' ) 
fprintf('\n' ) 
fprintf('\n' ) 
disp VSR POR RR BER ALE, TORRE RAR RK, RRR ARR RER ER BIRR, BORER TOK R RRR AK KAR IRAE LOKI RRR: 
disp('Functions to be compared Interval' ) 
GasSprCw wie 2s 0,1)" ) 
disp(' 2. 1/x 1,2]' ) 
disp(' 3 sqrt (x) tea ) 
disp(' 4. 1/sqrt (x) Ly2)' ) 
disp(' 5. log2 (x) 12)" ) 
disp(' 6 log(x) = In(x) 1,2]' ) 
disp(' 7 sin (pi*x) 0-1/2)" ) 
disp(' 8. cos(pi*x) O;-1/2).' ) 
disp(' 9. tan(pi*x) 0,1/4]' ) 
disp(' 10. sqrt (-log(x)) = sqrt (-1n(x) ) L512. 1 faq" ) 
disp(' 11. tan(pi*x)*2 + 1 0,1/4]' ) 
disp(' 12. -—(x*log2(x) + (1-x) *log2 (1-x) ) 1/256,1-1/256]' ) 
disp(' 13. 1/(1+exp (-x) ) = 1/(1+e% (-x)) 0,1]' ) 
disp(' 14. (1/sqrt (2*pi) ) *exp (-x*2/2) 0,sqrt(2)]' ) 
disp(' 15. sin(exp (x) ) Oy]. ) 
disp PPR IRAE A RAR RRA KEE K RR KIA RRR KAKA, BRR IEK IA KEKE RRA EK BRERA KAR ERR A) 





fo) 


% Get FUNCTION to be approximated (user input) 


select = input( 'Input the Function, func[sqrt (-1*log(x))]: MP 
if isempty (select) 

select = 10; % default 
end 
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FILE: 





getF.m 








function 

syms x 

interval = '({1/256, 
err_or_segs = 0; 
consegs = 200; 
epsilon = 0.0001; 


switch fnc_choice 


case l 


func 


case 2 


case 3 


case 4 


case 5 


case 6 


case 7 


case 8 


case 9 





case 10 


case 11 


case 12 


case 13 


case 14 











case 15 





interval = 


func = 
interval = 


func = 
interval = 


func = 
interval = 


func = 
interval = 


Func = 
interval = 


Func = 
interval = 


func a 
interval = 


func = 
interval = 


Func = 
interval = 























Lf 44s 6% default 
6% default 
S% default 
$% default 

2x" 3 

WOE is 

i] [see 

! pee “ 

"sqrt (x) '; 

, 72 say 

"1/sqrt (x) '; 

, rz CE 

"log2 (x) '; 

: oe Ne 

VLOG x)"; 

: 72 a 

"sin(pi*x)'; 

"T0,1/2 '; 

"cos (pi*x)'; 

"'0,1/2 Ws 

"tan (pi*x)'; 

'(0,1/4 a 

"sqrt (-log(x))'; 

MLA 5 2s -Ay), te 

"tan (pi*x).%2 + 1"; 

"[0,1/4]'; 

"—(x*log2(x) + (1-x) *log2(1-x))'; 

'(1/256,1-1/256]'; 

"1/ (1+exp (-x))'; 

"(0,1 Va 

'(1/sqrt (2*pi) ) *exp (-x*2/2)'; 

'[0,sqrt (2) ]"'; 


"sin(exp(x))'; 


[func, interval, vari_or_const,err_or_segs,consegs,epsilon,N]=... 


getF (fnc_choice) ; 
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interval = '[0,2]'; 
end Sswitch fnc_choice 


% Get CONSTANT OF VARIABLE segmentation (User input) 





r. 


vari_or_const = 0; 
while vari_or_const ~= 1 && vari_or_const ~= 2 && vari_or_const ~= 3 
vari_or_const = input( '(1)Non-uniform (2)Uniform Segmentation [1]: a 
if isempty (vari_or_const) 
vari_or_const = 1; $% default Non-uniform 
end 
end 





% If non-uniform segmentation, then enter ERROR parameters 
if vari_or_const ~= 2 





epsilon = input( ‘Input the Desired Error, epsilon[2%-33]: '); 
if isempty (epsilon) 

epsilon = 2°-33; $% default 
end 


end 


% If uniform segmentation, find how the user will restrict # of segments 
if vari_or_const == 





err_or_segs = input( 'Constrain by (1)Number of Segments or (2)Error [1]: 
if isempty (err_or_segs) 
err_or_segs = 1; $% default 
end 
if err_or_segs == 1 
consegs = input( 'Input the number of Desired Segments[20]: '); 
if isempty (consegs) 
consegs = 20; $% default 
end 
end 
if err_or_segs == 
epsilon = input( ‘Input the given_error; epsilon[2%*-16]: rye 
if isempty (epsilon) 
epsilon = 2%-16; $% default 
end 
end 
end 
N = input( 'Input the no. of pts the fct is to be evaluated, N[1000000]: '); 
if isempty (N) 
N = 1000000; 6% default 
end 
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APPENDIX B. HDL CODE 


B.1o MULTIPLIER CODE 
The VHDL code was adapted from Xilinx’s application note on pipelining a 
multiplier in the Virtex II family of chips[22]. The code is for 32 bit inputs and one 32 


bit product with the decimal point in the middle; 16 bit integer and 16 bit fraction. 


























1. VHDL 

-- School: NPS - Naval Postgraduate School, Monterey 
—- Student: Njuguna Macaria 

Create Date: 14:10:56 07/07/07 
—-— Design Name: 
—-— Module Name: mult_32to32 - Behavioral 
—-— Project Name: 

Target Device: xc2v6000-4ff1517 (virtex II in SRC-6) 
-- Tool versions: Xilinx 6.3031 and Synplicity 8.1 
-—- Simulation: Modelsim and Synplicity's simulation tool 


—- Description: 





Dependencies: Modified from 


-—-— Revision: 
-—- Revision 0.01 File Created 
-—- Additional Comments: 














KA KKK KKKKKKKKKKKKKKKKKK COMPONENTS NEEDED KKKKKKKKKKKKKKKKKKK KKK KA LK 























—-— UNSIGNED 16 BIT MULTIPLIER -- 








C5 





library ieee; 
use ieee.std_logic_1164.all,; 
Library UNISIM; 
use UNISIM.vcomponents.all; 








-- Entity: Description of pins (PORTS) 
entity mult1l6_32 is 





port ( au, bu: in std_logic_vector (15 downto 0); 
clk : in std_logic; 
produ : out std_logic_vector(31 downto 0)); 


end mult16_32; 


architecture mult16_32_ beh of mult16_32 is 
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component FDR 








port ( 
Q out STD_ULOGIC; 
D in STD_ULOGIC; 
€ in STD_ULOGIC; 
R in STD_ULOGIC) ; 


end component; 


component MULT18X18S 
port (A : in STD_LOGIC_V 
: in STD_LOGIC_V 
in STD_ULOGIC ; 
in  STD_ULOGIC ; 
: out STD _LOGIC_VECTOR (35 downto 0); 
R : in  STD_ULOGIC ); 
end component; 





CTOR (17 downto 0); 
CTOR (17 downto 0); 





ial 














GAAaAW 












































































































































Signal a_wire, b_wire: std_logic_vector(15 downto 
signal p_wire: std_logic_vector(31 downto 
signal discard: std_logic_vector( 3 downto 
attribute RLOC : string; 

attribute RLOC of REG_AO : label is "XOYO" ; 
attribute RLOC of REG Al : label is "XOYO" ; 
attribute RLOC of REG _A2 : label is "XOY1" ; 
attribute RLOC of REG _A3 : label is "XOY1" ; 
attribute RLOC of REG _A4 : label is "XOY2" ; 
attribute RLOC of REG_A5 : label is "XOY2" ; 
attribute RLOC of REG_A6 : label is "XOY3" ; 
attribute RLOC of REG _A7 : label is "X0OY3" ; 
attribute RLOC of REG _A8 : label is "XOY4" ; 
attribute RLOC of REG_A9 : label is "XOY4" ; 
attribute RLOC of REG_A10: label is "XOY5" ; 
attribute RLOC of REG_A1l1: label is "XOY5" ; 
attribute RLOC of REG_A12: label is "XOY6" ; 
attribute RLOC of REG_A13: label is "XOY6" ; 
attribute RLOC of REG _Al4: label is "XOY7" ; 
attribute RLOC of REG _A15: label is "XOY7" ; 

attribute RLOC of REG _Al6: label is "X-1Y7"; 
attribute RLOC of REG _Al7: label is "X-1Y7"; 

attribute RLOC of REG_BO : label is "X2Y0" ; 
attribute RLOC of REG Bl : label is "X2Y0" ; 
attribute RLOC of REG _B2 : label is "X2Y1" ; 
attribute RLOC of REG _B3 : label is "X2Y1" ; 
attribute RLOC of REG _B4 : label is "X2Y2" ; 
attribute RLOC of REG _B5 : label is "X2Y2" ; 
attribute RLOC of REG_B6 : label is "X2Y3" ; 
attribute RLOC of REG _B7 : label is "X2Y3" ; 
attribute RLOC of REG _B8 : label is "X2Y4" ; 
attribute RLOC of REG_B9 : label is "X2Y4" ; 
attribute RLOC of REG B10: label is "X2Y5" ; 
attribute RLOC of REG Bll: label is "X2Y5" ; 
attribute RLOC of REG B12: label is "X2Y6" ; 
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attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RL 
attribute RL 




































































attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RLOC 
attribute RL 
attribute RL 
attribute RL 
attribute RL 
attribute BE 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 
attribute BEL o 




























































































































































































of REG_B13: label is "X2Y6" 

of REG_B14: abe is "X2Y7" ; 

of REG_B15: abe is "X2y7" 

OC of REG_B16: label is "X-1Y6"; 
OC of REG_B17 abel is "X-1Y6"; 
of REG_PO : label is "X-2Y0"; 

of REG_P1 : label is "X1yYO" 

of REG_P2 : label is "X1YO" ; 

of REG_P3 : abel. is "XPY¥1" ; 

of REG _P4 : abe Tse SET 

of REG_P5 : label is "X3Y0" ; 

of REG_P6 : label is "X3Y0" ; 

of REG_P7 : label is "X3Y1" ; 

of REG_P8 : label is "X-2Y2"; 

of REG_P9 : abel is "X1Y2" ; 

of REG_P10: abel is "X1Y2" ; 

of REG_P1l: abel is "X1Y3" ; 

of REG P12: label is "X1Y3" ; 

of REG_P13: label is "X3Y2" ; 

of REG_P14: abel is "X3Y2" ; 

of REG_P15: label is "X3Y3" ; 

of REG P16: abel is "X-2yY4"; 

of REG_P17: abel is "X1Y4" ; 

of REG_P18: abel is "X1Y4" ; 

of REG_P19: label is "X1Y5" ; 

of REG_P20: label is "X1Y5" ; 

of REG_P21: label is "X3Y4" ; 
of REG_P22: label is "X3Y4" ; 
of REG_P23: label is "X3Y5" ; 

of REG _P24: abel is "X-2Y6"; 

of REG P25: label is "X1Y6" ; 

of REG_P26: abe is; "XTY6" 

of REG _P27: abe 7S. MOST Ms 

of REG _P28: abel is "X1Y7" ; 

of REG_P29: label is "X3Y6" ; 

of REG_P30: abel is "X3Y6" ; 

of REG_ P31: label is "X3Y7" 

OC of REG_P32: label is "X3Y1" ; 
OC of REG_P33: label is "X3Y3" ; 
OC of REG_P34: label is "X3Y5" ; 
OC of REG_P35: label is "X3Y7" ; 
string; 

f REG_AO label is "FFX" ; 

fF REG_Al label is "FFY" ; 

f REG _A2 label is "FFX" ; 

f REG _A3 label is "FFY" ; 

fF REG_A4 label is "FFX" ; 

fF REG_A5 label is "FFY" ; 

f REG _A6 label is "FFX" ; 

fF REG_A7 label is "FFY" ; 

f REG _A8 label is "FFX" ; 

fF REG_A9 label is "FFY" ; 























— 
N 
~ 


































































































































































































































































































































































































attribute BEL of REG_A10: abel is "FFX" ; 
attribute BEL of REG All: abe is. “FEY” -: 
attribute BEL of REG_A12: label is "FFX" ; 
attribute BEL of REG_A13: label is "FFY" ; 
attribute BEL of REG_A14: label is "FFX" ; 
attribute BEL of REG_A15: label is "FFY" ; 
attribute BEL of REG_A16: label is "FFX" 
attribute BEL of REG_Al17: label is "FFY" 
attribute BEL of REG_BO abel is "FFX" ; 
attribute BEL of REG Bl abel is "FFY" ; 
attribute BEL of REG B2 abel is "FFX" ; 
attribute BEL of REG_B3 abel is "FFY" ; 
attribute BEL of REG _B4 abel is "FFX" ; 
attribute BEL of REG_B5 abel is "FFY" ; 
attribute BEL of REG _B6é abel is "FFX" ; 
attribute BEL of REG _B7 abel is "FFY" ; 
attribute BEL of REG _B8 abel is "FFX" ; 
attribute BEL of REG _B9 abel is "FFY" ; 
attribute BEL of REG B10: label is "FFX" ; 
attribute BEL of REG Bll: label is "FFY" ; 
attribute BEL of REG_B12: label is "FFX" ; 
attribute BEL of REG_B13: label is "FFY" ; 
attribute BEL of REG_B14: label is "FFX" ; 
attribute BEL of REG_B15: label is "FFY" 
attribute BEL of REG_B16: label is "FFX" 
attribute BEL of REG_B17: label is "FFY" 
attribute BEL of REG _PO abel is "FFY" 
attribute BEL of REG Pl abel is "FFX" ; 
attribute BEL of REG _P2 abel is "FFY" ; 
attribute BEL of REG _P3 abel is "FFX" ; 
attribute BEL of REG P4 abel is "FFY" ; 
attribute BEL of REG_P5 abel is "FFX" ; 
attribute BEL of REG _P6 abel is "FFY" ; 
attribute BEL of REG _P7 abel is "FFX" ; 
attribute BEL of REG P8 abel is "FFY" ; 
attribute BEL of REG_P9 : label is "FFX" ; 
attribute BEL of REG_P10: label is "FFY" ; 
attribute BEL of REG Pll: label is "FFX" ; 
attribute BEL of REG_P12: label is "FFY" ; 
attribute BEL of REG_P13: label is "FFX" ; 
attribute BEL of REG _P14: label is "FFY" ; 
attribute BEL of REG_P15: label is "FFX" ; 
attribute BEL of REG_P16: label is "FFY" ; 
attribute BEL of REG P17: label is "FFX" ; 
attribute BEL of REG_P18: label is "FFY" ; 
attribute BEL of REG_P19: label is "FFX" ; 
attribute BEL of REG P20: label is "FFY" ; 
attribute BEL of REG P21 abel is "FFX" ; 
attribute BEL of REG P22 abel is "FFY" ; 
attribute BEL of REG_P23: label is "FFX" ; 
attribute BEL of REG P24 abel is "FFY" ; 
attribute BEL of REG P25 abel is "FFX" ; 
attribute BEL of REG P26: abe Te MERYN 
































’ 


| & 


1’ 


’ 
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attribute BEL of REG_P27: label is "FFX" ; 
attribute BEL of REG_P28: label is "FFY" ; 
attribute BEL of REG_P29: label is "FFX" ; 
attribute BEL of REG_P30: label is "FFY" ; 
attribute BEL of REG P31: label is "FFX" ; 
-- attribute BEL of REG_P32: label is "FFY" 
= attribute BEL of REG_P33: label is "FFY" 
a attribute BEL of REG_P34: label is "FFY" 
aa attribute BEL of REG_P35: label is "FFY" 
begin 
REG_AO FDR port map(Q => a_wire(0) , © 
=> "O') 
, 
REG_Al FDR port map(Q => a_wire(1) 4 JE 
=> "O') 
la 
REG_A2 FDR port map(Q => a_wire(2) ae 
=> TO) 
a 
REG_A3 FDR port map(Q => a_wire (3) oe 
=> '0'); 
REG_A4 FDR port map(Q => a_wire (4) 7. AG 
=> "O') 
1, 
REG_A5 FDR port map(Q => a_wire(5) pS 
=> POY 
1, 
REG_A6 FDR port map(Q => a_wire(6) pen C 
=> "O') 
la 
REG_A7 FDR port map(Q => a_wire(7) AS 
=> UK ORE) 
la 
REG_A8 FDR port map(Q => a_wire(8) Pane: 
=> '0'); 
REG_A9 FDR port map(Q => a_wire(9) px 
=> "O'); 
la 
REG_A10 FDR port map(Q => a_wire(10) , C 
=> "O'); 
1, 
REG All FDR port map(Q => a_wire(1l1l) , C 
=> "O'); 
’ 
REG_A12 FDR port map(Q => a_wire(12) , C 
=> "O'); 
£ 
REG_A13 FDR port map(Q => a_wire(13) , C 
=> '0'); 
REG _A14 FDR port map(Q => a_wire(14) , C 
=> "O'); 
1’ 
REG_A15 FDR port map(Q => a_wire(15) , C 
=> "O'); 
1, 
= REG_A16 FDR port map(Q => a_wire(16) 
Rae 70')3 
tee REG_A17 FDR port map(Q => a_wire(17) 
psa tON: 
REG_BO FDR port map(Q => b_wire(0) Pane: 
=> "O'); 
la 
REG_B1 FDR port map(Q => b_wire(1) oe 
=> "O'); 
la 
REG_B2 FDR port map(Q => b_wire(2) pone 
=e 8O")9 
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REG_B3 : FDR port map(Q => b_wire (3) , C => CLK, D => bu(3) Paes 
=> '0'); 

REG_B4 : FDR port map(Q => b_wire (4) , C => CLK, D => bu(4) 7 OR 
=S20")3 

REG_B5 : FDR port map(Q => b_wire(5) , C => CLK, D => bu(5) wR 
=> '0"); 

REG_B6 : FDR port map(Q => b_wire (6) , C => CLK, D => bu(6) , R 
=> '0'); 

REG_B7 : FDR port map(Q => b_wire(7) , C => CLK, D => bu(7) ge oR 
=> '0'); 

REG_B8 : FDR port map(Q => b_wire (8) , C => CLK, D => bu(8) pr na 
=> '0'); 

REG_B9 : FDR port map(Q => b_wire(9) , C => CLK, D => bu(9) , R 
=e PONG 

REG_B10 FDR port map(Q => b_wire(10) , C => CLK, D => bu(10 oh 
=> '0'); 

REG_B11 : FDR port map(Q => b_wire(11) , C => CLK, D => bu(1l 7 BR 
=> '0'); 

REG_B12 : FDR port map(Q => b_wire(12) , C => CLK, D => bu(12 7 IR 
=> '0'); 

REG_B13 : FDR port map(Q => b_wire(13) , C => CLK, D => bu(13 , R 
=> '0'); 

REG_B14 : FDR port map(Q => b_wire(14) , C => CLK, D => bu(14 pooR 
=: 0")3 

REG_B15 : FDR port map(Q => b_wire(15) , C => CLK, D => bu(15 eo UR 
=> '0'); 
== REG_B16 : FDR port map(Q => b_wire(16) , C => CLK, D => '0' 
£ RSS NOs 
-- REG_B17 : FDR port map(Q => b_wire(17) , C => CLK, D => '0O! 











y RaeotO"ys 


Mult1l : MULT18X18S 
port map(P(31 downto 0) => p_wire, P (35 downto 32) => discard(3 
downto 0), 



































A (17 downto 16) => "00", A(15 downto 0) => a_wire, 
B (17 downto 16) => "00", B(15 downto 0) => b_wire, 
C => CLK, 
CE => Ub, 
R => '0"); 
REG_PO : FDR port map(Q => produ(0 , C => CLK, D => p_wire (0) 1 
R => '0"); 
REG _P1 : FDR port map(Q => produ(l , C => CLK, D => p_wire(1) 7 
R => 10"); 
REG_P2 : FDR port map(Q => produ(2 , C => CLK, D => p_wire (2) j 
R => 10"); 
REG_P3 : FDR port map(Q => produ(3 , C => CLK, D => p_wire (3) i 
R => '0"); 
REG _P4 : FDR port map(Q => produ (4 , C => CLK, D => p_wire (4) ; 
R => 10"); 
REG _P5 : FDR port map(Q => produ(5 , C => CLK, D => p_wire(5) , 
R => '0"); 
REG_P6 : FDR port map(Q => produ(6 , C => CLK, D => p_wire(6) A 
R => 10"); 
REG _P7 : FDR port map(Q => produ(7 , C => CLK, D => p_wire(7) % 
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R => 10"); 

REG_P8 : FDR port map(Q => produ(8) , C => CLK, D => p_wire (8) 
R => 10"); 

REG_P9 : FDR port map(Q => produ (9) , C => CLK, D => p_wire(9) 
R => 10"); 

REG_P10 : FDR port map(Q => produ(10) , C => CLK, D => p_wire(10) 
R => 10"); 

REG_P11 : FDR port map(Q => produ(1l) , C => CLK, D => p_wire(11) 
R => 10"); 

REG_P12 : FDR port map(Q => produ(12) , C => CLK, D => p_wire(12) 
R => 10"); 

REG_P13 : FDR port map(Q => produ(13) , C => CLK, D => p_wire(13) 
R => 10"); 

REG_P14 : FDR port map(Q => produ(14) , C => CLK, D => p_wire(14) 
ee 

REG_P15 : FDR port map(Q => produ(15) , C => CLK, D => p_wire(15) 
R => 10"); 

REG_P16 : FDR port map(Q => produ(16) , C => CLK, D => p_wire(16) 
R => 10"); 

REG_P17 : FDR port map(Q => produ(17) , C => CLK, D => p_wire(17) 
R => '0"); 

REG_P18 : FDR port map(Q => produ(18) , C => CLK, D => p_wire(18) 
R => '0"); 

REG_P19 : FDR port map(Q => produ(19) , C => CLK, D => p_wire(19) 
R => '0"); 

REG_P20 : FDR port map(Q => produ(20) , C => CLK, D => p_wire(20) 
R => 10"); 

REG_P21 : FDR port map(Q => produ(21) , C => CLK, D => p_wire(21) 
R => 10"); 

REG_P22 : FDR port map(Q => produ(22) , C => CLK, D => p_wire(22) 
R => 10"); 

REG_P23 : FDR port map(Q => produ(23) , C => CLK, D => p_wire(23) 
R => 10"); 

REG_P24 : FDR port map(Q => produ(24) , C => CLK, D => p_wire(24) 
R => 10"); 

REG_P25 : FDR port map(Q => produ(25) , C => CLK, D => p_wire(25) 
R => 10"); 

REG_P26 : FDR port map(Q => produ(26) , C => CLK, D => p_wire(26) 
R => 10"); 

REG_P27 : FDR port map(Q => produ(27) , C => CLK, D => p_wire(27) 
R => 10"); 

REG_P28 : FDR port map(Q => produ(28) , C => CLK, D => p_wire(28) 
R => 10"); 

REG_P29 : FDR port map(Q => produ(29) , C => CLK, D => p_wire(29) 
eee 

REG_P30 : FDR port map(Q => produ(30) , C => CLK, D => p_wire(30) 
R => 10"); 

REG_P31 : FDR port map(Q => produ(31) , C => CLK, D => p_wire(31) 
R => '0"); 
-- REG_ P32 : FDR port map(Q => discard( 3) , C => CLK, D 
p_wire(32) , R => '0'); 
— REG_P33 : FDR port map(Q => discard( 2) , C => CLK, D 
p_wire(33) , R => '0'); 
-- REG_ P34 : FDR port map(Q => discard( 1) jo ESS. CLK; D. 


p_wire(34) , R => '0'); 
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oa REG_P35 FDR port map(Q => discard( 0) , C => CLK, 
p_wire(35) , R => '0'); 
end mult16_32_beh; 

-- 32 BIT MULTIPLIER -- 








libr 
use 
use 
use 


enti 
PORT 


END 





arch 






























































ary IEEE; 
IEEE.STD_LOGIC_1164.ALL; 
IEEE.STD_LOGIC_ARITH.ALL; 
IEEE.STD_LOGIC_UNSIGNED.ALL; 
ty mult_32to032 is 
( 
a, b in std_logic_vector (31 downto 0) ; 
clk in std_logic; 
prod out std_logic_vector (31 downto 0) 


mult_32to032; 





itecture structural of 


mult_32to32 is 


i 





Declare component: 


Unsinged 16 


bit Multiplier 





component mult16_32 





port( au, bu: in std_lo 
Gack in std_lo 
produ 


end component; 


gic_vector (15 downto 0 


gic; 


i 


out std_logic_vector(31 downto 0)); 




















SS @Oro; © 


OOOO OO 


Intemediate signals for multiplier stage 
SIGNAL MOO : std_logic_vector(31 downto 
SIGNAL MO1 std_logic_vector(31 downto 
SIGNAL M10 std_logic_vector(31 downto 
SIGNAL M02 std_logic_vector(31 downto 
SIGNAL M11 std_logic_vector(31 downto 
SIGNAL M20 std_logic_vector (31 downto 

Intermediate signals for Adding stage 
SIGNAL AO0O : std_logic_vector(33 downto 
SIGNAL AOl std_logic_vector(49 downto 
SIGNAL Al10 std_logic_vector(49 downto 
SIGNAL A02 std_logic_vector(65 downto 
SIGNAL All std_logic_vector(65 downto 
SIGNAL A20 std_logic_vector(65 downto 
-—- Some definitions for implementing sign extend 
SIGNAL ae std_logic_vector(15 downto 0); 
SIGNAL be std_logic_vector(15 downto 0); 














Ne Ne Ne Ne Ne 


~“e 


Ne Ne Ne Ne Ne 


~“e 
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-- Signal to hold value (synplify pro 














will not work 


























—- if the width is not matched, Xilinx will) 
SIGNAI prdtl : std_logic_vector(49 downto 0); 
SIGNAI prdt2 std_logic_vector(65 downto 0); 
SIGNAI prdt3 std_logic_vector(65 downto 0); 
SIGNAI prdt4 std_logic_vector(65 downto 0); 
SIGNAI prdt5 std_logic_vector(65 downto 0); 
SIGNAI prdt6 std_logic_vector(65 downto 0); 
—-SIGNAI b std_logic_vector(31 downto 0); 
-—- BEGIN the 32 bit Multiplier 
BEGIN 
PROCESS (clk) 
VARIABLE zer std_logic_vector(15 downto 0) 


—- zeros 








VARIABLE ones 


-— ones 





U00 


vol 


U10 








EGIN 


























IF clk'event and clk = '1' TH 
IF (a(15) = '1') THEN 
ae(15 downto 0) <= on 
ELSE 
ae(15 downto 0) <= ze 
END IF; 
IF (b(15) = '1') THEN 
be(15 downto 0) <= on 
ELSE 
be(15 downto 0) <= ze 
END IF; 
END IF; 





END PROCESS; 





PORT MAP 


mult16_3 
PORT MAP 


mult16_3 
PORT MAP 


32 
(au 
bu 
clk 


std_logic_vec 








—-- Apply the Multiplies 
: multl6_ 


(15 downto 0)= 
(15 downto 0)= 


produ(31 downto 0)= 


i 
2 
(au 
bu 
clk 


(15 downto 0)= 
(15 downto 0)= 


produ(31 downto 0)= 


i 

2 
(au 
bu 





(15 downto 0)= 
(15 downto 0)= 





tor (15 





es; 


vc; 


downto 0) 


dow 
dow 


dow 


dow 


dow 


dow 


dow 
dow 


nto 
nto 


nto 





nto 








X"0000"; 


X"PFFF"; 














u02 


U1l 


U20 





clk 
produ (31 
3 

mult16_32 

PORT MAP (au (15 
bu (15 
clk 
produ (31 
\; 

mult16_32 

PORT MAP (au (15 
bu (15 
clk 
produ (31 
; 

mult16_32 

PORT MAP (au (15 
bu (15 
clk 
produ (31 


i 


dow 


dow 


dow 


dow 


dow 


dow 


dow 


dow 


dow 


dow 


nto 


nto 


nto 


nto 


nto 


nto 


nto 


nto 
nto 





nto 





> M10 (31 dow 
>a (15 dow 
> be (15 dow 
& “Clk, 

> MO2 (31 dow 
>a (31 dow 
>b (31 dow 
> clk, 

> M11 (31 dow 
> ae (15 dow 
> b (15 dow 
> clk, 

> M20 (31 dow 


—-- shift the values appropriately for addition 


PROC] 





B 





ESS ( 
EGI 





PROC 


ESS ( 





EGIN 


LE 


clk) 
N 


IF clk'event and clk = 


A00 (33 
AO00 (31 


A01 (49 
AO1 (47 
AO1(15 


A10 (49 
A10 (47 
A10(15 


A02 (65 
A02 (63 
A02 (31 


A11(65 
All (63 
A11(31 


A20(65 
A20 (63 
A20 (31 


END if; 
ESS; 





clk) 


clk'event and 


dow 
dow 


dow 
dow 
dow 


dow 
dow 
dow 


dow 
dow 
dow 


dow 
dow 
dow 


dow 
dow 
dow 


nto 
nto 


nto 
nto 
nto 


nto 
nto 
nto 


nto 
nto 
nto 


nto 
nto 
nto 


nto 
nto 





nto 


pe Pies gt 
32) <= 
0) <= 
48) <= 
16) <= 
0) <= 
48) <= 
16) <= 
0) <= 
64) < 
32) <= 
0) <= 
64) < 
32) <= 
0). <= 
64) < 
32) <= 
0) <= 
1' then 


en 
WO.O™. 
M00 (31 downto 


= "00"; 


M0O1(31 downto 


= x"0000"; 


= "00"; 


M10(31 downto 


= x"0000"; 


= "00"; 


M02 (31 downto 


= X"00000000",; 


— "O00"; 


M11(31 downto 


= X"00000000"; 


= "00"; 


M20 (31 downto 


= X"00000000"; 


nto 


nto 


nto 


nto 


nto 


nto 


nto 


nto 
nto 





nto 


16 
16 

















prdtl <= unsigned(A00) + unsigned(AO1) + unsigned(Al10); 
prdt2 <= unsigned(A02) + unsigned(Al1) + unsigned (A20); 








prdt3 <= unsigned(prdt2) + unsigned(prdt1); 
prod <= prdt3(47 downto 16); 

END IF; 

END PROCESS; 











END structural; 





2 Verilog 








// $Id: S_MULT_64TO64_SRC6.v,v 1.1 2007/06/25 18:20:29 pvg Exp §$ 





// Copyright 2007 SRC Computers, Inc. All Rights Reserved. 
// Manufactured in the United States of America. 


// SRC Computers, Inc. 

// 4240 N Nevada Avenue 

// Colorado Springs, CO 80907 
// (wv) (719) 262-0213 

// (£) (719) 262-0223 


// No permission has been granted to distribute this software 
// without the express permission of SRC Computers, Inc. 


// This program is distributed WITHOUT ANY WARRANTY OF ANY KIND. 





iy. DESCRIPTION: This module performs 64 bit signed integer 
multiplication 
// and provides a 64 bit result. 
// This module instantiates Xilinx components. 
Ed 





// This file was modified by Njuguna Macaria to make a 64 bit by 64 bit 
// Multiplier with a 64 bit result that is shifted to the appropriate 





// decimal point for a 32 bit integer and 32 bit fraction. 














// // 
// // 
// 32 BIT MULTIPLIER // 
// // 


“timescale lIns/ins 


module mult32_64s ( 
input [31:0] A; 
input [31:0] B; 
output [63:0] QO 


A, B, Q, CLK, CLR); 


id 
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reg 
reg 


wire 
wire 
wire 
wire 


reg 
reg 
reg 
reg 





LK 
LR; 





63%: 


31s 
She 


Borg 
35s 
35% 
35% 


eye 
BSL; 
31s 
She 


/* 


OO.) © 





OOOO 


synthesis syn_noclockbuf=1 syn_maxfan=100000 


Q; 


AR; 
BR; 


RO; 
R1; 
R2; 
R3; 


RO_R; 
R1_R; 
R2_R; 
R3_R; 





always @ (posedge CLK or posedge CLR) 


begin 
if (CLR) begin 
AR <= 0; 
BR <= 0; 
end 
else begin 
AR <= A; 
BR <= B; 
end 
end 


MULT18X18S XO ( 








A ({2'bO, AR[15:0]}), 
-B ({2'bO, BR[15:0]}), 
#€ (CLK), 

-R (CLR), 

.CE (1l'bl), 

P (RO) 


MULT18X18S Xl ( 

















.A ({2'bO, AR[31:16]}), 
.B ({2'bO, BR[15:0]}), 

Ae: (CLK), 

.R (CLR) v 

.CE (1"b1), 

.P (R1) 

i 

MULT18X18S X2 ( 

LA ({2'bO, AR[15:0]}), 

.B ({2'bO, BR[31:16]}), 
ac (CLK), 

.R (CLR), 

.CE (1'bl1), 

.P (R2) 


RY, 


¥ 
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i 


MULT18X18S X3 ( 








A ({2'bO, AR[31:16]}), 
-B ({2'bO, BR[31:16]}), 
a€ (CLK), 

-R (CLR), 

.CE (1l'bl), 

P (R3) 


always @ (posedge CLK or posedge CLR) 


begin 
if (CLR) begin 
RO_R <= 0; 
R1_R <= 0; 
R2_R <= 0; 
R3_R <= 0; 
end 


else begin 
RO_R <= RO; 








R1I_R <= Rl; 
R2_R <= R2; 
R3_R <= R3; 


end 
end 


always @ (posedge CLK or posedge CLR) 


begin 
if (CLR) begin 
Q <= 0; 
end 


else begin 
// add and shift 
Q <= RO_R + {R1_R,16'bO} + {R2_R,16'b0O} + {R3_R,32'b0}; 
end 
end 


endmodule 








64 BIT MULTIPLIER 











B, Q, CLK, CLR); 


input [63:0] A; 

input [63:0] B; 

output [63:0] Q; 

input CLK /* synthesis syn_noclockbuf=1 syn_maxfan=100000 */ ; 
input CLR; 
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reg 127:0] Q_R; 
reg 63:0] Q; 
reg 63:0] AR; 
reg 63:0] BR; 
wire 63:0] RO; 
wire 63:0] Rl; 
wire 63:0] R2; 
wire 63:0] R3; 
reg 63:0] RO_R; 
reg 63:0] R1_R; 
reg 63:0] R2_R; 
reg 63:0] R3_R; 








always @ (posedge CLK or posedge CLR) 
begin 
if (CLR) begin 


AR <= 0; 
BR <= 0; 
end 
else begin 
AR <= A; 
BR <= B; 
end 


end 


mult32_64s XO ( 








.A (AR[31:0]), 
B (BR[31:0]), 
-Q (RO), 

.CLK (CLK), 

CLR (CLR) 

i 


mult32_64s X1 ( 








A (AR[63:32]), 
7B (BR[31:0 ]), 
-Q (R1), 

.CLK (CLK), 

CLR (CLR) 


mult32_64s X2 ( 








ant (AR[31:0]), 
.B (BR[63:32]), 
-Q (R2), 

sCER (CLK), 

CLR (CLR) 


mult32_64s X3 ( 
A (AR[63:32]), 
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63:32]), 


C2429 08 





( 
( 
LK ( 
LR ( 





i 


always @ (posedge CLK or posedge CLR) 


begin 
if (CLR) begin 
RO_R <= 0; 
R1_R <= 0; 
R2_R <= 0; 
R3_R <= 0; 
end 


else begin 

RO_R <= RO; 
Rl R <= Ri; 
R2 R <= R2; 
R3_R <= R3; 








end 
end 


always @ (posedge CLK or posedge CLR) 
begin 
if (CLR) begin 
Q <= 0; 
end 
else begin 
// add and shift 
QR <= RO_R + {R1_R,32'b0O} + {R2_R,32'bO} + {R3B_R,64'b0}; 
// Only take 64 bits from the middle for a 32.32 number 
Q <= Q_R[95:32]; 
end 
end 


endmodule 
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APPENDIX C. SRC C CODE 


C.l1 UNIFORM SEGMENTATION 
1. Floating Point 


a. Main.c 








#include<stdio.h> 
#include<stdlib.h> 
#include<strings.h> 
#include<libmap.h> 





// Subroutine initialization in Main 
void subr_map( double acoef[], 
int ncoef, 











double incre, 
double offsetv, 
double x[], 
double yl], 
double ysl], 
int npts, 


int64_t *timed, 
int64_t *timel, 
int mapnum) ; 


// MAIN 
main () { 


// Initialize Variables 





FILE *fpl; 
double *array, *x, *y, *ys, incre,val,offsetV; 
int i,ir,nc,npts, mapnum,nmap, ncoef, arr_indx, inNum; 


int64_t tm0, tml; 


// Start NFG and select map number 
printf ("\n***START UP THE NFG ***\n"); 

















mapnum = 0; 

nmap Shy 

// ! allocate map to this problem 

map_allocate (nmap); 

// User interface 

print£(* \n"); 
printf ("Function La °° S2AR ames LS ou er 
printf ("Function 2. 1/x sD Na es 
printf ("Function 3: sqrt (x) F 3\n"); 
printf ("Function 4. 1/sqrt (x) >: 4\n"); 
printf ("Function 5. log2(x) ¢ SN" ).s 
printf ("Function 6. 1n(x) 2°) .6\nN 4 
printf ("Function 7. sSin(pi*x) s TNA) 
printf ("Function 8. cos (pi*x) >: 8\n"); 
printf ("Function 9. tan(pi*x) 5 ONn")G 
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10. 
11. 
12. 
13. 
14. 
15.3 


sqrt (-lLn (x) ) 


—(x*log2 ( 
1/ (1+e% (- 





(1/sqrt (2*pi) ) 
x 


sin (exp ( 





*exp (-x*2/2) 





printf ("Function 
printf ("Function 
printf ("Function 
printf ("Function 
printf ("Function 
printf ("Function 
print£(* 

printf ("\nSelect which 
scanf("Si", 

printf 











// Open the Hex data 


switch (inNum) 


{ 


case 1: 
break; 
case 2: 
break; 
case 3: 
break; 
case 4: 
break; 
case 5: 
break; 
case 6: 
break; 
case 7: 
break; 
case 8: 
break; 
case 9: 
break; 
case 10: 
break; 
case ll: 
break; 
case 12: 
break; 
case 13: 
break; 
case 14: 
break; 
default: 





break; 


} 





printf ("fpl %i\n", 


// Read in the values from the file 
(fpl, 
(fpl, 
(fpl, 


fscanf 
fscanf 
fscanf 


W Si W 7 
W 
PEN, 
W 
PE 


m3] 


Wo] 


fol 


fol 


Fpl 


fol 


fol 


fpl 


fol 


fol 


fol 





Fpl 





&inNum) ; 
("What value did I enter: %1 





fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 
fope 


fope 


function to implement: 


file to read 

















fpl); 


&éncoef); 


&incre); 
&o0ffsetv); 


// Depending on number segments 


n("Data/memD1. 
n("Data/memD2. 
n("Data/memD3. 
n("Data/memD4. 
n("Data/memD5. 
n("Data/memD6. 
n("Data/memD7. 


n("Data/memD8. 





n("Data/memD9. 
n("Data/memD1 
n("Data/memD1 
n("Data/memD1 
n ("Data/memD1 


n("Data/memD1 





n("Data/memD1 


",inNum) ; 


em "W 


em "W 











-mem 
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/fjnc. = 503 // For 16 bit accuracy 
//nc = 600; // For 23 bits 

//nce = 1500; // For 32 bits 

nc = 35000; // For 40 bits 

array double*) Cache_Aligned_Allocate 
x double*)Cache_Aligned_Allocate 
y double*)Cache_Aligned_Allocate 
ys = (double*) Cache_Aligned_Allocate 























( 
( 
( 
( 





// check if the right thing was read 
printf (" ncoef %$i\n",ncoef); 


// vead_file 
for (i=0;i<ncoef; itt) { 


Fscanf (fpl, "Slf", &val); 
array[1i*4] = val; 


Fscanf (fpl, "Slt", &val); 
array[1i*4+1] = val; 


Fscanf (fpl, "Slf", &val); 
array[i*4+2] = val; 























fscanf (fpl, "Slf", &val); 
array[i*4+3] = val; 

} // end read_file 

fclose(fpl); 

















npts = 30; 
// create_samples 
for (ir=O;ir<npts;irtt+) { 
arr_indx = ir % ncoef; 
x[ir] = array[arr_indx*4]; 
printf ("ir $3i x_values are: %lf\n",ir,x[ir]); 
} //end create_samples 


fe) 


printf ("main ncoef %i npts %i\n",ncoef,npts); 


subr_map (array, ncoef, incre, offsetV, x, y, ys, npts, &tm0, &tml, 
mapnum) ; 


printf (MT \ nk & KK RK KR KKK BACK FROM MAP KKKKKKKKKK\ YM) s 
printf ("%Slld clocks for NFG\n", tm0); 
printf ("%lld clocks for SRC Macro\n", tml); 

















for (i=O;i<npts;itt+) { 
printf ("x: %$5.161f ysubr: %5.161f ySRCMacro: %5.161f\n", 
x[il,ylil,ys[il ); 
} 


// ! release the map resources 
map_free (nmap); 
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b. subr.mc 


#include <libmap.h> 


void subr_map ( double acl], 


int ncoef, 
doub] i 
doub 

double xc[], 
doub] 

doub 

int npts, 
int64_t *timed, 
int64_t *timel, 
int mapno) 











[KOR KR KKK KK KK KR I I RI I I I OK eK 


* Declarations 
OK KK I A I I I I I I I A A  / 























OBM_BANK_A (ysmap, double, MAX_OBM_SIZE) 
OBM_BANK_B (a, double, MAX_OBM_SIZE) 
OBM_BANK_C (b, double, MAX_OBM_SIZE) 
OBM_BANK_D (c, double, MAX_OBM_SIZE) 
OBM_BANK_E (x, double, MAX _OBM_ SIZE) 
OBM_BANK_F (y, double, MAX_OBM_SIZE) 
int i,j, nbytes, indx; 

int64_t tmO,tm1; 

double varx, indxtmp; 


[KOR KR KKK KK KK KK KK KK KK KK 


* Read in the cooeff and segment endpoints 
KOK KR KR RR KK OK KK / 


nbytes = 4*ncoef * 8; /* 4 data values (seg,a,b,c), 64bits each */ 
DMA_CPU (CM20OBM, ysmap, MAP_OBM_stripe(1,"A,B,C,D"), ac, 1, nbytes, 


wait_DMA (0); 


[KOR KKK KK KK KK KR I I I I I A KK KK OK 


* Read in the Number of points 

HK KK I I I I I I A I / 
nbytes = npts * 8; 

DMA_CPU (CM20BM, x, MAP_OBM_stripe(1,"E"), xc, 1, nbytes, 0); 
wait_DMA (0); 





[ORK RR KKK KK IK KK I I I I I AK RK OK 





* Useful in Debug Mode to determine when in Map 
HO A I I I I I I I I I A I I / 


printf (M\n\nk * KKK KR KK NOW IN MAP KKKKKKKKKK\ OM) » 
printf ("MAP subr ncoef %i npts %i\n",ncoef,npts); 


[BORK RR KKK RK KK KR I I A OR I RK OK 








* Read timer and use a constant for UNIFORM Segmentation 
HK KK A A I I I I I I I I I / 


read_timer (&tm0); 
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printf("incre: %15.101f offset: 
for (i=0;i<npts;it++) 


{ 


$15.101f\n", 


incre, offsetV); 


varx = x[i]; 
indxtmp = incre * varx; 
indx = (int) (indxtmp-offsetV) ; // For interval [a,b]; when 
yli] = alindx] *varx*varx + varx*b[indx] + c[indx]; 
// For Debug only 
printf ("indxtmp: %15.101f indx: Si x: %15.101f a: %15.101f ", 
indxtmp, indx, varx, afindx]); 
printf("b: %15.101f c: %$15.101f fx: %15.101f\n", 
b[indx], c[indx], aigiieta es 
} 
read_timer (&tml); 
xtimeO = tml-tm0; 
read_timer (&tm0); 
if (ncoef == 4017) { 
for (i=0; i<npts; i++) 
ysmap[i] = sqrt (-1*logf(x[i])); // func 10 
// ysmap[i] = cosf(x[i]*3.14159265358979); // func 8 
} 
read_timer (&tml); 
*timel = tml — tm0; 


[ROKK RK KKK KK KK OK OK KK KK KK KK 


KKKKKKKKKKKKKKKKKKKKKKKKKKK 


* Send back the results 

OK HR I I I I I I I  / 
nbytes = npts * 8; 

DMA_CPU (OBM2CM, y, MAP_OBM_stripe(1,"F"), yc, 1, nbytes, 0); 
wait_DMA (0); 

nbytes = npts * 8; 

DMA_CPU (OBM2CM, ysmap, MAP_OBM_ stripe(1,"A"), ys, 1, nbytes, 0); 
wait_DMA (0); 
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Cc. Sample memory file (nemD13.mem) 
23 
23.000000000000000000 
0.000000000000000000 
0.043477043477043474 -0.001358317312431898 0.250022142664697360 0.499999946525873650 
0.086956086956086961 -—0.004069869528842258 0.250257856231724860 0.499994717163829820 
0.130434130434130440 -0.006766102264296870 0.250726624819223480 0.499974235918064340 
0.173912173912173920 -0.009436856238895420 0.251423133697143810 0.499928719944731090 
0.217390217390217380 -0.012072268340317297 0.252339521794440860 0.499848954352031030 
0.260869260869260880 -0.014662753872468844 0.253465478604624370 0.499726502843754640 
0.304347304347304340 -0.017199021635054011 0.254788348118468570 0.499553907045252270 
0.347825347825347850 -0.019672204841648999 0.256293304836314910 0.499324864391809510 
0.391303391303391310 -0.022073970679148382 0.257963583077645050 0.499034376288904240 
0.434782434782434780 -0.024396553731653503 0.259780689724019740 0.498678874917245770 
0.478260478260478240 -0.026632698067803585 0.261724551035193380 0.498256341372862010 
0.521738521738521750 -0.028775795055367724 0.263773815334481300 0.497766372051921200 
0.565216565216565270 -0.030819956163696906 0.265906159795873900 0.497210208629803470 
0.608695608695608680 -0.032759983918098333 0.268098508370670290 0.496590760719450740 
0.652173652173652200 -0.034591349895399928 0.270327244592241440 0.495912606692870410 
0.695651695651695600 -0.036310255047355668 0.272568518769947920 0.495181941782176340 
0.739129739129739120 -0.037913669840421889 0.274798561824121660 0.494406487986867260 
0.782608782608782640 -0.039399282695242094 0.276993876874879700 0.493595416118664640 
0.826086826086826040 -0.040765458997357298 0.279131424123494290 0.492759249956049420 
0.869564869564869560 —-0.042011264516131290 0.281188891968362940 0.491909715721972900 
0.913042913042913070 -0.043136460147426350 0.283144934523894110 0.491059574388106770 
0.956521956521956480 -0.044141446445428945 0.284979312245647760 0.490222473401396690 
1.000000000000000000 —-0.045027205024233290 0.286673001843708750 0.489412798087615010 
2. Fixed Point 
a. Main.c 
#include<stdio.h> 
#include<stdlib.h> 
#include<strings.h> 
#include<libmap.h> 
#include<math.h> 
// Subroutine initialization in Main 
void subr_map (int64_t acoef[], 
int ncoef, 
int64_t incre, 
inté64_t offsetv, 
inté64_t x[], 
inté64_t yl[l, 
int xpts, 
int64_t *timed, 
int mapnum) ; 
// MAIN 
main () { 


// Initialize Variables 





FILE *fpl; 

int 1,ir,nc,xpts, inNum; 

int mapnum, nmap,ncoef; 

int arr_indx; 

int64_t *arraym, *xm, *ym, incre, offsetV; 
int64_t tm0,tm1,hexval; 
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char 


hexstr[80], 


// Starting NFG 











printf ("\n***START UP THE NFG ***\n"); 
mapnum = 0; 
nmap S15 
// User interface 
printé(" Vn) 
printf ("Function Ta 25x 1\n"); 
printf ("Function 2a" Ts 2\n"); 
printf ("Function 3. sqrt (x) 3\n"); 
printf ("Function 4, /sqrt (x) 4\n"); 
printf ("Function 5. Log2 (x) 5\n"); 
printf ("Function 6. I1n(x) 6\n"); 
printf ("Function 7. sSin(pi*x) 7\n"); 
printf ("Function 8. cos (pi*x) 8\n"); 
printf ("Function 9. tan(pi*x) 9\n"); 
printf("Function 10. sqrt (-l1n(x)) 10\n"); 
printf("Function 11. tan(pi*x)*2 + 1 = 1d Na) + 
printf("Function 12. —(x*log2(x) + (1-x) *log2(1-x)): 12\n"); 
printf("Function 13. 1/(1+e% (-x)) : 13\n"); 
printf£("Function 14. (1/sqrt (2*pi) ) *exp (-x*2/2) 14\n"); 
printf("Function 15. sin (exp (x) ) 15\n"); 
printf (" Ns gee 
//inNum = 1; // dummy default value 
printf ("\nSelect which function to implement: "); 
scanf("%$i", &inNum) ; 
printf ("What value did I enter: %i \n ",inNum); 
// Open the Hex data file to read 
switch (inNum) 
{ 
case 1: fpl = fopen("Data/memH1.mem","r") ; 
break; 
case 2: fpl = fopen("Data/memH2.mem","r") ; 
break; 
case 3: fpl = fopen("Data/memH3.mem","r") ; 
break; 
case 4: fpl = fopen("Data/memH4.mem","r") ; 
break; 
case 5: fpl = fopen("Data/memH5.mem","r") ; 
break; 
case 6: fpl = fopen("Data/memH6.mem","r") ; 
break; 
case 7: fpl = fopen("Data/memH7.mem","r") ; 
break; 
case 8: fpl = fopen("Data/memH8.mem","r") ; 
break; 
case 9: fpl = fopen("Data/memH9.mem","r") ; 
break; 





*token, 


*stpstr; 








strDelimit []=" 


Laas 
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case 10: £ = fopen("Data/memH10.mem","r") ; 
break; 

case ll: f = fopen("Data/memH11.mem","r") ; 
break; 

case 12: fpl = fopen("Data/memH12.mem","r"); 
break; 

case 13: fpl = fopen("Data/memH13.mem","r"); 
break; 

case 14: fpl = fopen("Data/memH14.mem","r"); 
break; 

default: f = fopen("Data/memH15.mem","r") ; 
break; 

} 

Prins. “("tp1..3iAn"; tpl); 

// ! allocate map to this problem 

map_allocate (nmap); 

// Read in the number of segments (decimal #) 

fscanf (fpl, "Si", &ncoef); 

fscanf (fpl, "Sllx", &incre); 

fscanf (fpl, "Sllx", &offsetV); 

printf ("ncoef: %3i incre: %811x\n",ncoef,incre) ; 





// Accommodate lots of resutls 


nc 30000; 


// array is enough room to hold 4 64 bit data pieces 
// Perform cache allignment 




















arraym = (int64 _t *)Cache_Aligned_Allocate (4*ncoef*8); 
xm = (int64_t *)Cache_Aligned_Allocate (nc*8 ); 
ym = (int64_t *)Cache_Aligned_Allocate (nc*8 ); 
// Get rid of first npc 
fgets (hexstr, sizeof hexstr, fpl); 
// Read all endpoints and coefficients into OBM banks 
for (i=0;i<ncoef;i+tt) { 

fgets (hexstr, sizeof hexstr, fpl); 

token = strtok(hexstr,strDelimit); 

sscanf (token, "%Sl1x", &hexval); 

arraym[i*4] = hexval; 


toke 
ssca 


n 


nf 


(token, 


strtok (NULL, str 


HoT is, 


Delimit); 


&hexval); 


arraym[i*4+1] 


token 


strtok (NULL, str 


ssca 





nf 


hexval; 


& 


Delimit); 


(token, 


arraym[i*4+2] 


toke 
ssca 


n 





nf 


(token, 


arraym[i*4+3] 


iva eB alae 
hexval; 


strtok (NULL, str 


oe ie 
hexval; 





hexval); 


Delimit); 
hexval); 





& 
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} 


// close the file 
fclose(fpl); 





// create some values to test with 
xpts = 100; 
for (ir=O;ir<xpts;irtt+) { 
arr_indx = ir % ncoef; 
xm[ir] = arraym[arr_indx*4]; // Optional -0x2061d; 
printf ("arr_indx = %3i xm[%2i]= %1011x\n", 
arr_indx, ir, xm[ir]); 
} 
printf ("Right Before MAP *** \nmain ncoef %i xpts %i\n", 


subr_map 


ncoef, xpts); 


(arraym, ncoef,incre,offsetV, xm, ym, xpts, &tm0,mapnum) ; 





printf ("\n************Back from the MAP!!! *****xkeee\ NN) 5 
printf (Mn & KK RK RK KK SHIFT8 KKKKKKKKKKKKKKK\ YN) » 
printf ("%lld clocks\n", tm0); 
for (i=0;i<xpts;itt) { 

printf ("i: S31 x: S811x fx: $1011x\n",i,xm[i],ym[i] ); 


} 




















printf ("%lld clocks\n", tml); 
// ! release the map resources 
map_free (nmap); 
} 
b. subr.mc 
#include <libmap.h> 
void subr_map (int64_t acl], 
int ncoef, 
int64_t incre, 
int64_t offsetV, 
inté64_t xc[], 
int64 t ycl], 
int xpts, 
int64_t *timed, 
int mapno) { 





[BORK RR KKK KK KK KR I RR I OR RK OK KK 


* Declarations 


HK KR I I I A A KK KK / 








OBM_BANK_A (segend, int64_t, 
OBM_BANK_B (a, int64_t, 
OBM_BANK_C (b, int64_t, 
OBM_BANK_D (c, int64_t, 
OBM_BANK_E (x, int64_t, 
OBM_BANK_F (y, int64_t, 
int i,j, nbytes; 


MAX_OBM_S1IZI 
MAX_OBM_S1IZ! 
MAX_OBM_S1IZI 
MAX_OBM_S1IZI 
MAX_OBM_S1IZI 
MAX_OBM_S1IZI 


_ 





DS: Ca Ge El: ee Ee 


ye rere rv HS 
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inté64_t tm0,tml,varx,varsq, vara, varb, varc,ax2,bx1,f£x; 
int64_t varxtmp, indx; 


[BORK RR KKK KK KR KK KK KK OK OK KK 


* Read into OBM. Cooeff & segment endpoints * 
HO KR I RI OR RK OK OK KK / 


// 4 data values (seg,a,b,c), 64bit Hex values 
nbytes = 4*ncoef * 8; 











DMA_CPU (CM20BM, segend, MAP_OBM_stripe(1,"A,B,C,D"),ac,1,nbytes, 0); 
wait_DMA (0); 
// Read in the Number of points 
nbytes = xpts * 8; 
DMA_CPU (CM20BM, x, MAP_OBM_stripe(1,"E"), xc, 1, nbytes, 0); 
wait_DMA (0); 
// DEBUG: determine when in Map 
printf (TM\n\nk KKK KK KH NOW IN MAP KKKKKKKKKK\ HM) s 
printf ("MAP subr ncoef %i xpts %i \n",ncoef,xpts); 
[KOR KR KKK KK KK KK KR RK IK I KK IK OK OK Kk 
* Read timer and use selector to determine the segment - 
HK KR RK I I I I I OR IK I OK RK KK / 
read_timer (&tm0); 
incre >>= 16; // asr to open integer bits 
offsetV >>= 16; // asr to match in subtraction 
for (i=0;i<xpts;itt+) 
{ 
varx =" sca] // Take from OBM put in BRAM 
indx = varx * incre; // Segment index Number * x input 
indx >>= 32; // Return to 16 fraction points 
indx = indx - offsetvV; // Adjust index to interval start 
indx >>= 16; // cemove fracion 
vara = a[ (int) indx]; // Move from OBM into BRAM 
varb = b[ (int) indx]; 
varc = c[ (int) indx]; 
ffm aas= Square X and shift ----// 
varx >>= 8; // Remove lower 8 bits, 40.24 
varsqd = varx*varx; // Now we have 80.48 -—> 16.48 
varsq >>= 24; // SRL eliminate 40.24 


if (varx < 0x8000000000000000) // if varx is positive 


varsq = varsgq & OxOQQOOOFFFFFFFFFF; // bitwise AND; 24bits 


// --- X*2 * first Coefficient -// 

vara >>= 8; // cemove lower 8 bits, 40.24 
ax2 = varsq*vara; // alindx]; 

ax2 >>= 16; // Want 32.32, so srl 16 


if (vara < 0x8000000000000000) // if both +ve 


ax2 = ax2 & OxOOQOOFFFFFFFFFFFF; // bitwise AND; 16bits 





150 











// --- X * second Coefficient --// 





varb >>= 8; // Remove lower 8 bits, 40.24 
bx1 = varx*varb; // both are already shifted 
bxl >>= 16; // Return to 32.32 (int.fract) 
if (varb < 0x8000000000000000) // if both +ve 

bx1l = bxl & OxOOOOFFFFFFFFFFFF; // bitwise AND; 16bits 
// -- 3 input add to complete --// 
yli] = ax2+bxlt+varc; // no need to shift varc 
// DEBUG 
// printf ("indx: %4llx -> %41li varx: %611x incre: %611x\n", 
// indx, (int) indx, varx, incre); 


} 


// Time it took to compute 
read_timer (&tml); 
xtimeO = tml-tm0; 


[ORK RR KKK KK KK KK KR I I I A RK KK 


* Send back the results 

OR KR I I I I A I I I RK / 
nbytes = xpts * 8; 

DMA_CPU (OBM2CM, y, MAP_OBM_stripe(1,"F"), yc, 1, nbytes, 0); 
wait_DMA (0); 
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C.2. NON-UNIFORM SEGMENTATION 
1. Floating Point 


a. Main.c 


#include<stdio.h> 


#include<stdlib.h> 
#include<strings.h> 
#include<libmap.h> 





// Subroutine initialization in Main 
void subr_map( double acoef[], 
int ncoef, 
double x[], 
double yl], 
double ysl], 
int npts, 
int64_t *timed, 
int64_t *timel, 
int mapnum) ; 








// MAIN 
main () { 


// Initialize Variables 
FILE *fpl; 

double *array, *x, *y, *ys; 
double val; 








int i,ir,nc,npts, mapnum,nmap, ncoef,arr_indx; 


int64_t tm0, tml; 








printf ("\n***START UP THE NFG ***\n"); 


// select map number 


mapnum = 0; 
nmap Sly 
// ! allocate map to this problem 


map_allocate (nmap); 


// Depending on number segments 





























//nce = 50; // For 16 bit accuracy 
nc = 200; // For 23 bits 

//nce = 1500; // For 32 bits 

//nce = 5000; // For 42 bits 

array = (double*) Cache_Aligned_All]l 
x = (double*) Cache_Aligned_All 
y = (double*) Cache_Aligned_Al 
ys = (double*) Cache_Aligned_Al 
fpl = fopen ("Data/memDEC.mem","r") ; 


fscanf (fpl, "Si", &ncoef); 
// check if the right thing was read 
printf (" ncoef %Si\n",ncoef); 





1’ 


ocate 
ocate 


locate 
locate 
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/* 


AY 


// 


// vread_file 


for 


} // 






































(1=0; i<ncoef;i+t) { 
Fscanf (fpl, "Slf", &val); 
array[i*4] = val; 

fscanf (fpl, "Slf", &val); 
array[1i*4+1] = val; 

Fscanf (fpl, "Slf", &val); 
array[1i*4+2] = val; 

Ffscanf (fpl, "Slf", &val); 
array[i*4+3] = val; 

end read_file 


// print_array 


for 


} // 


npts 


(i1=0;i<ncoef;it+t+) {f{ 

printf (" endpt %10.6f a %10.6f b %10.6f c $10.6f\n", 
array [4*i+0], 

array[4*i+1l], 

array[4*i+2], 

array [4*i+3]); 

end print_array 





= 100; 


// create_samples 


for 


he sfef 


(ir=O;ir<npts;irtt) { 
arr_indx = ir % ncoef; 
x[ir] = array[arr_indx*4]; 


printf ("ir $31 x_values are: %lf\n",ir,x[ir]); 
end create_samples 


printf ("main ncoef %i npts %i\n",ncoef,npts) ; 


subr_map (array, ncoef, x, y, ys, npts, &tm0, &tml, mapnum); 


printf (Unk KK RK KK KK BACK FROM MAP KKREKKKKKKK\ HM) 5 
printf ("%lld clocks\n", tm0); 
printf ("%lld clocks\n", tml); 


for 

















(i1=O;i<npts;it++) { 
printf ("x: $5.181f ysubr: %5.181f SRCMacro2*x: %5.18f\n", 
x[il, ylil, ys[il); 


printf ("x: %5.18£ ysubr: %5.18£\n",x[i],ylil); 


} 
// 


! release the map resources 


map_free (nmap); 
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b. subr.mc 








#include <libmap.h> 


void subr_map ( double acl], 


ant ncoef, 
double xc[], 
double ycl], 
double ysl], 
int npts, 
int64_t *timed, 
int64_t *timel, 
int mapno) 





[BORK ROKR KKK KK KK KK I IO RI A A OR I RK 


* Declarations 
HK KR I A I I I I I I I RK OK / 











OBM_BANK_A (ysmap, double, MAX_OBM_ SIZE) 
OBM_BANK_B (a, double, MAX _OBM_ SIZE) 
OBM_BANK_C (b, double, MAX_OBM_ SIZE) 
OBM_BANK_D (c, double, MAX_OBM_ SIZE) 
OBM_BANK_E (x, double, MAX _OBM_ SIZE) 
OBM_BANK_F (y, double, MAX_OBM_ SIZE) 
int i,j, nbytes, indx, sel; 
int64_t tm0O,tm1; 

double Varx; 





[ROKK RK KKK KK KK KK KK KK OK KK KK 


* Read in the cooeff and segment endpoints 
HK I AR KK OR KK I OK KK / 


nbytes = 4*ncoef * 8; /* 4 data values (seg,a,b,c), 64bits each */ 
DMA_CPU (CM2OBM, ysmap, MAP_OBM_stripe(1,"A,B,C,D"), ac, 1, nbytes, 


wait_DMA (0); 


[ROKR KR KR KKK KK KK KR I I I I I AO OR 


* Read in the Number of points 

HK HR I I I I I I I A / 
nbytes = npts * 8; 

DMA_CPU (CM20BM, x, MAP_OBM_stripe(1,"E"), xc, 1, nbytes, 0); 
wait_DMA (0); 





[ROKK RR KKK RK KK KR I I I A OR OK OK 





* Useful in Debug Mode to determine when in Map 
OK KK I I I I I IO A A I RK OK / 


printf (M\n\nk * KKK KKK KK NOW IN MAP KKAEKKKKKKK\ YM) » 
printf ("MAP subr ncoef %i npts %i\n",ncoef,npts); 


[KOR KR KKK KK KK KK I IR OR I I OR RK KK 








* Read timer and use a constant for UNIFORM Segmentation 
KH I A I A A I I I I I OK KK / 


read_timer (&tm0); 
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for 


{ 


(i1=0; i<npts;it++) 


varx 


el 


el 


switch 


{ 





se 


=< Td; 


varx <= 1. 
sel = 
if ( varx 
sel = 
if ( varx 
sel = 
if ( varx 
sel = 
(sel) 





case 


select_pri_ 


break; 
case 2: 


select_pri_ 


010456600772177400) 

pa 1.254138569173091300) 
1.393018722518969900) 
= 1.414213562373095100) 
4; 


64bit_32val( varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx <= 


A 
ll 
ODO: OS OO OO: ©: -@--O'O OO ' OG: O'O'O: O-O OC OC. @:O-O°O' CO: © 


64bit_8val ( varx <= 1. 


Varx 








-065896761049097793, 0; 
.113411555832503830, 1, 
.155068672182882060, 2, 
.193392483833442240, 35 
-229466279456250750, 4, 
-263888271986404410, 5, 
-297033228392699020, 6, 
-329159950015850300, 7, 
-360453699017791120, 8, 
-391055896896180420, 9, 
-421076852419192020, 10, 
-450608489560304140, 11, 
-479725761713275970, iy 
-508490894337077280, 13, 
-536959041815795230, 14, 
-565178287458633520, 15, 
-593191057714890110, 16, 
-621035536388932610, 17, 
-648747078855175910, 18, 
-676359626273058010, 19, 
.703904291372063890, 20, 
.731409358451725390, 21, 
.758903111811573990, 22, 
-786413835751141770, 23; 
.813969814569960310, 24, 
-841596504137608230, 25, 
-869322188753617440, 26, 
-897175152717519460, 27, 
-925183680328846240, 28, 
-953378884317082620, 29, 
.981791877411713590, 30, 

Bik &indx) ; 
039409823987864900, 32, 
-068692559293097600, 33, 
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} 


varx <= 1.098347233137172600, 34, 
varx <= 1.128421928829294700, 35, 
varx <= 1.158970386538573600, 36, 
varx <= 1.190056245938955900, 37, 
varx <= 1.221750217779271400, 38, 
39, &indx); 
break; 
case 3: 


select_pri_64bit_A4val ( varx <= 1.287320295168777200, 40, 
varx <= 1.321418432469292400, 41, 

varx <= 1.356585716292107800, 42, 

43, &indx) ; 


iy 


break; 
case 4: 


select_pri_64bit_A4val ( varx <= 1.414213562373095100, 44, 
varx <= 1.414213562373095100, 44, 

varx <= 1.414213562373095100, 44, 

44, &indx) ; 


bh 


break; 
} 
yli] = aflindx] *varx*varx + varx*b[indx] + c[indx]; 
ff printf ("i $3i a $f b Sf c Sf x $20.18f y %20.18f\n", 
// indx,a[indx],b[indx],c[indx],varx,yl[il]); 


read_timer (&tml); 


* 


timed 


= tml-tm0; 


read_timer (&tm0); 
// Function 1 



































for (i=0; i<npts; itt) 

ysmap[i] = (1/sqrtf(2*3.14159265258979) ) *powf (2.71828182845905, - 
0.5*powf(x[i],2)); // func 14 
//ysmap[i] = powf(2,x[il]); // func 1 
//ysmap[i] = 1/x[il; if. Lune. 2 
//ysmap[i] = sqrtf(x[i]); if fune.3 
//ysmap[i] = 1/sqrtf(x[il); // fune 4 
//ysmap[i] = logf (x[i])/0.693147180559945; // func 5 
//ysmap[i] = logf(x[i]); // func 6 
//ysmap[i] = sinf(x[i]*3.14159265258979) ; // fune 7 
//ysmap[i] = cosf (x[i]*3.14159265258979) ; // func 8 
//ysmap[i] = tanf(x[i]*3.14159265258979); // fune 9 
//ysmap[i] = sqrt (-1l*logf(x[i])); // func 10 
//ysmap[i] = powf (tanf (x[i]*3.14159265258979),2); // func 11 
//ysmap [i = -(x[i]*logf(x[i])/0.69314718055994  +(1-x[i]) *logf(1- 
x[i])/0.69314718055994 );// func 12 
//ysmap[i] = 1/ (1t+powf (0.693147180559945, (-1*x[i]))); // func 13 
//ysmap [i = (1/sqrtf (2*3.14159265258979) ) *powf (2.71828182845905, - 
0.5*powf(x[i],2)); // func 14 
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//ysmap[i] = sinf (powf (2.71828182845905,x[i])); // func 15 
read_timer (&tml); 


*timel = tml — tm0; 


[KOR KR KKK RK KK KR I I I I I I A I I A KK OK 


* Send back the results 

HO KR I I I I I I I A I RK OK / 
nbytes = npts * 8; 

DMA_CPU (OBM2CM, y, MAP_OBM_stripe(1,"F"), yc, 1, nbytes, 0); 
wait_DMA (0); 

nbytes = npts * 8; 

DMA_CPU (OBM2CM, ysmap, MAP_OBM_ stripe(1,"A"), ys, 1, nbytes, 0); 
wait_DMA (0); 











2. Fixed Point 


a. Main.c 








#include<stdio.h> 
#include<stdlib.h> 
#include<strings.h> 
#include<libmap.h> 
#include<math.h> 





// Subroutine initialization in Main 
void subr_map (int64_t acoef[], 

E ncoef, 
t64_t x[], 

té64_t yl, 

t xpts, 

t64_t *timed, 

t mapnum) ; 











nN 
nN 
nN 
nN 
nN 
nN 


Heo Be BBB 


// MAIN 
main () { 


// Initialize Variables 





FILE *fpl; 

int 1,ir,nc,xpts; 

int mapnum, nmap,ncoef; 
int arr_indx; 


int64_t *arraym, *xm, *ym; 
int64_t tm0,tml,hexval; 
char hexstr[80], *token, *stpstr, strDelimit[]=" \n"; 


// Starting NFG 
printf ("\n***START UP THE NFG ***\n"); 
mapnum = 0; 
nmap Sql; 
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/* 


*/f 


// ! allocate map to this problem 
map_allocate (nmap); 
ne = 300; 





// array is enough room to hold 4 64 bit data pieces 
// Perform cache allignment 














arraym = (int64 _t *)Cache_Aligned_Allocate (4*nc*8); 
xm = (int64_t *)Cache_Aligned_Allocate (nc*8 ); 
ym = (int64_t *)Cache_Aligned_Allocate (nc*8 ); 











// Open the Hex data file to read 
fpl = fopen ("Data/memHEXOx.mem","r") ; 
printf ("fpl Si\n",fpl); 





// Read in the number of segments (decimal #) 
fscanf (fpl, "Si", &ncoef); 
printf (" ncoef $i\n",ncoef); 


// Get rid of first npc 
fgets (hexstr, sizeof hexstr, fpl); 


// Read all endpoints and coefficients into OBM banks 
for (i=0;i<ncoef; itt) { 
fgets (hexstr, sizeof hexstr, fpl); 











token = strtok(hexstr,strDelimit); 
sscanf (token, "%S11x", &hexval); 
arraym[i*4] = hexval; 
token = strtok (NULL, strDelimit); 
sscanf (token, "%S11x", &hexval); 
arraym[i*4+1] = hexval; 
token = strtok( (NULL, strDelimit); 
sscanf (token, "%S11x", &hexval); 
arraym[i*4+2] = hexval; 
token = strtok (NULL, strDelimit); 
sscanf (token, "%S11x", &hexval); 
arraym[i*4+3] = hexval; 

} 

fclose(fpl); 





// print out the contents of the array first 30 elements only 
for (i=0;1i<30;itt) { 
printf ("endpoint: %llx a: S1llx b: S11lx c: %S11x \n", 
arraym[i*4],arraym[i*4+1],arraym[i*4+2],arraym[i*4+3]); 


} 





// create some values to test with 


xpts = 30; 

for (ir=O;ir<xpts;irtt+) { 
//arr_indx = (int) fabs (remainder (ir,20)); 
arr_indx = ir % ncoef; 
xm[ir] = arraym[arr_indx*4];//+0xa0000000; 
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printf ("arr_indx = %d xm[%d]= %11x\n",arr_indx,ir,xm[ir]); 


printf ("Right Before MAP elias \nmain ncoef SL xpts 
Si\n",ncoef,xpts); 

subr_map (arraym, ncoef, xm, ym, xpts, &tm0, mapnum); 

Prantth: .(M\ne* eee ee eee eBaCk: Erom the. “MAP! Lexar ee eae Na) 


printf ("%lld clocks\n", tm0); 

for (i=0;i<xpts;itt) { 

printf ("i: S3d x values: %1611x y values: $%161l1x \n", 
i, xm[i], ym[i]); 


} 


// ! release the map resources 
map_free (nmap); 

















b. subr.mc 
#include <libmap.h> 
void subr_map (int64_t acl], 
as tarts ncoef, 
int64 tt x*«cf[],; 
int64 t ycll, 
int xpts, 
int64_t *timed, 
int mapno) { 


[ROKK RR KKK KK KK KK KK IK OR I I IK OR OR A OK OK KK 


* Declarations 
HO A A I I I A I KR KK / 








OBM_BANK_A (segend, int64_t, MAX_OBM_SIZE) 

OBM_BANK_B (a, int64_t, MAX _OBM SIZE) 

OBM_BANK_C (b, int64_t, MAX _OBM SIZE) 

OBM_BANK_D (c, int64_t, MAX _OBM SIZE) 

OBM_BANK_E (x, int64_t, MAX _OBM SIZE) 

OBM_BANK_F (y, int64_t, MAX _OBM SIZE) 

int i,j, nbytes, sel; 

int64_t tm0,tml, indx, varx, varsq, vara, varb,varc, ax2,bx1, fx; 








[KOR KR KKK KK KK KK KK OK KKK OK KK OK KK 


* Read into OBM. Cooeff & segment endpoints * 
HK KR KK I RI OR RK KK KK / 


// 4 data values (seg,a,b,c), 64bit Hex values 

nbytes = 4*ncoef * 8; 

DMA_CPU (CM20BM, segend, MAP_OBM_stripe(1,"A,B,C,D"), ac, 1; 
nbytes, 0); 

wait_DMA (0); 


// Read in the Number of points 
nbytes = xpts * 8; 
DMA_CPU (CM20BM, x, MAP_OBM_stripe(1,"E"), xc, 1, nbytes, 0); 
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wait_DMA (0); 





// DEBUG: determine when in Map 
printf (M\n\nk KKK KK KK NOW IN MAP KKKKKKKKKA\ NM) » 
printf ("MAP subr ncoef %i xpts %i \n",ncoef,xpts); 


[KR KK KKK KK KK RK IK OK I IR KK OK OK KK 





* Read timer and use selector to determine the segment i 
HK KR I A I A OR OR KK / 
read_timer (&tm0); 

for (i=0;i<xpts;it+t+) 

{ 





varx = x[i]; 

if ( varx <= 0x000000001816a7a6) 
sel = 1; 

else if ( varx <= 0x000000003b3b34a8) 
sel = 2; 

else if ( varx <= 0x0000000040000000) 
sel = 3; 


switch (sel) 


{ 





case : 
select_pri_64bit_128val( varx <= 0x0000000000841cdf, 
varx <= 0x0000000000885b08, 
varx <= 0x00000000008cbea6, 
varx <= 0x000000000091438e, 
varx <= 0x000000000095edeb, 
varx <= 0x00000000009abdbc, 
varx <= 0x00000000009fb301, 
varx <= 0x0000000000a4d1e3, 
varx <= 0x0000000000aal1a64, 
varx <= 0x0000000000af8c8l1, 
varx <= 0x0000000000b5283d, 
varx <= 0x0000000000baf1lbf, 
varx <= 0x0000000000c0e908, 
varx <= 0x0000000000c71241, 
varx <= 0x0000000000cd6d6a, 
varx <= 0x0000000000d3fa84, 
varx <= 0x0000000000dabdb8, 
varx <= 0x0000000000e1b705, 
varx <= 0x0000000000e8e66b, 
varx <= 0x0000000000f05015, 
varx <= O0x0000000000fF7£401, 
varx <= 0x0000000000ffdé65a, 
varx <= 0x000000000107£71F, 
varx <= 0x0000000001105651, 
varx <= 0x000000000118f818, 
varx <= 0x000000000121e09e, 
varx <= 0x00000000012b0fe3, 
varx <= 0x00000000013485e7, 
varx <= 0x00000000013e46d4, 
varx <= 0x00000000014856d2, 














ee ee | 


s 


ee | 


DMIAIAADTOPWNFDWOOWAANTVDAYIHBWBNEF O 
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s 
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160 











Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 


= 0x000000000152b5e2, 


0x00000000015d6404, 
0x000000000168698a, 
0x000000000173c675, 
0x00000000017£f7ac4, 
0x00000000018b8aal, 
0x000000000197fa35, 
0x0000000001la4cda9, 
0x0000000001b204fe, 
0x0000000001bfa45d, 
0x0000000001cdabc6, 
0x0000000001dc238b, 
0x0000000001leb0bad, 
0x0000000001fa6855, 
0x00000000020a3dac, 
0x00000000021a8fdc, 
0x00000000022b5ee4, 
0x00000000023cb318, 
0x00000000024e8c77, 
0x000000000260ef2a, 
0x000000000273e386, 
0x0000000002876988, 
0x00000000029b8985, 
0x0000000002b0437b, 
Ox0000000002c59fbf, 
Ox0000000002db9e4F, 
Ox0000000002£F2477£, 
0x0000000003099f78, 
0x000000000321lae8c, 
0x00000000033a74bc, 
0x000000000353fa5a, 
0x00000000036e4390, 
0x00000000038958b0, 
0x0000000003a539bb, 
0x0000000003c1£32c, 
0x0000000003df892d, 
0x0000000003fdffe7, 
0x00000000041d5fac, 
0x00000000043db0d1, 
0x00000000045efba6, 
0x0000000004814457, 
0x0000000004a48f0b, 
0x0000000004c8e83F, 
0x0000000004ee541c, 
Ox000000000514df1f, 
0x00000000053c8d71, 
0x0000000005656765, 
O0x00000000058f£7975, 
0x0000000005bac7cd, 
0x0000000005e75ee8, 
0x0000000006154718, 
0x00000000064488b1, 
0x000000000675302f, 
0x0000000006a745e3, 








= 0x0000000006dad222, 
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Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 
Varx 


break; 
case 2: 


select_pri_64bit_32val( varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 
varx 


= 0x00000000070fe58F, 


0x0000000007468456, 
0x00000000077ebfla, 
0x0000000007b89e30, 
0x0000000007£42e12, 
0x0000000008317b3d, 
0x000000000870922d, 
0x0000000008b17f5e, 
0x0000000008£45375, 
0x00000000093916c6, 
Ox00000000097£dA9F5, 
O0x0000000009c8a97E, 
0x000000000a139609, 
0x000000000a60b038, 
0x000000000ab00489, 
0x000000000b01a3al, 
0x000000000b559e25, 
0x000000000bac04bb, 
0x000000000c04e808, 
0x000000000c605cdb, 
0x000000000cbe6fb0, 
0x000000000d1£3980, 
0x000000000d82c6c5, 
0x000000000de93079, 
0x000000000e528741, 
0x000000000ebedfeb, 
0x000000000£2e5370, 
O0x000000000fa0f275, 
0x000000001016d5f3, 
0x00000000109016e0, 
0x00000000110cca0d, 
0x00000000118d0871, 
0x000000001210e6db, 
0x000000001298826c, 
0x000000001323£f41d, 
0x0000000013b3590F, 
0x000000001446cé611, 
0x0000000014de5c6e, 
0x00000000157a3947, 
0x00000000161a7594, 
0x0000000016bf32a0, 





= 0x0000000017688d8d, 


AANA FWNE 
~ 


s 





s 


s 


s 


s 


s 


s 





126, 


V2, &indx) ; 


= 0x0000000018c9a234, 


0x0000000019819a5b, 
0x000000001la3eb58d, 
0x000000001b01193£, 
0x000000001bc8e294, 
0x000000001cC963b27, 
0x000000001d694444, 
0x000000001e422789, 





= 0x000000001f210a6a, 





128, 
129, 
130, 
131, 
132, 
133; 
134, 
135, 
136, 





162 




















varx <= 0x0000000020061683, 137, 
varx <= 0x0000000020f17574, 138, 
varx <= 0x0000000021e350d9, 139, 
varx <= 0x0000000022dbd679, 140, 
varx <= 0x0000000023db2ff2, 141, 
varx <= 0x0000000024e18b0b, 142, 
varx <= 0x0000000025ef158a, 143, 
varx <= 0x000000002703fd37, 144, 
varx <= 0x0000000028207402, 145, 
varx <= 0x000000002944a7b1, 146, 
varx <= 0x000000002a70ce5e, 147, 
varx <= 0x000000002ba519fa, 148, 
varx <= 0x000000002celc09e, 149, 
varx <= 0x000000002e26f43b, 150, 
varx <= 0x000000002f74ef13, 5, 
varx <= 0x0000000030cbe73f£, 152, 
varx <= 0x00000000322cl2da, 153, 
varx <= 0x000000003395ac27, 154, 
varx <= 0x000000003508f191, 155, 
varx <= 0x0000000036861933, 156, 
varx <= 0x00000000380d5d4f, 157, 
varx <= 0x00000000399efc52, 158, 


159, &indx); 
break; 
case 3: 


select_pri_64bit_4val ( varx <= 0x000000003ce244bd, 160, 
varx <= 0x000000003e9466d5, 161, 
varx <= 0x0000000040000000, 162, 

162; &indx) ; 


break; 
} 
fe fee Shift by 8 bits - ----- ff 
vara = a[indx]; 
varb = b[indx]; 
varx >>= 8; // Shift right 8 for mult 40.24 
vara >>= 8; // Shift right 8 
varb >>= 8; // Shift right 8 
Mpiumaass Square X and shift ----// 
varsq = varx*varx; // Now we have 80.48 —> 16.48 
varsq >>= 24; // SRL eliminate 40.24 
varsq = varsq & OxOOOOOOFFFFFFFFFF; // bitwise AND; 24bits 
// -- X*°2 * first Coefficient --// 
ax2 = varsq*vara; // alindx]; 
ax2 >>= 16; // Want 32.32, so srl 16 
if (vara < 0x8000000000000000) // if both +ve 

ax2 = ax2 & OxOOQOOFFFFFFFFFFFF; // bitwise AND; 16bits 
// --- X * second Coefficient --// 
bx1 = varx*varb; // both are already shifted 
bxl >>= 16; // Return to 32.32 (int.fract) 
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if (varb < 0x8000000000000000) // if both tve 








bxl = bxl & OxOOOOFFFFFFFFFFFF; // bitwise AND; 16bits 

// -- 3 input add to complete --// 

ylil = ax2+bxl+c[indx]; // Add all, no need to shift varc 

// DEBUG: printf for debug information on variable status 

printf ("indx: %31, varx: %811x vasq: %101l1x a: %1011x ax2: 
$161l1x b: %$1611x bx1l: %1611x c: %1011x fx: %1611x \n", 

(int)indx, varx, varsq, vara, ax2, 

varb, bx1, c[indx], V7 Pays 


} 
// Time it took to compute 


read_timer (&tml); 
*xtimeO = tml-tm0; 


[KOK KKK KK KK KR KI I I I IK OK I KK 


* Send back the results 
HK KR I I I I I KK KK / 


nbytes = xpts * 8; 
DMA_CPU (OBM2CM, y, MAP_OBM_stripe(1,"F"), yc, 1, nbytes, 0); 
wait_DMA (0); 


3. Fixed Point with Macro 


This implementation did not produce the correct values. The multiplier macro 


used in this case was the VHDL macro shown in Appendix B. 


The user can add macros to the Makefile that are coded in VHDL, Verilog or in 
both description languages. Here we show two VHDL files added to the Makefile and the 


blk.v and info files. 


a. Makefile 











SId: Makefile,v 2.0.0.1 2005/06/10 23:12:59 hammes Exp $ 
Copyright 2003 SRC Computers, Inc. All Rights Reserved. 


Manufactured in the United States of America. 


SRC Computers, Inc. 

4240 N Nevada Avenue 
Colorado Springs, CO 80907 
(v) (719) 262-0213 

(f£) (719) 262-0223 
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No permission has been granted to distribute this software 
without the express permission of SRC Computers, Inc. 


This program is distributed WITHOUT ANY WARRANTY OF ANY KIND. 














User defines FILES, MAPFILES, and BIN here 














FILES = main.c 
MAPFILES = subr.mc 
BIN = nfg 











Multi chip info provided here 
(Leave commented out if not used) 








PRIMARY = <primary file 1> <primary file 2> 


SECONDARY <secondary file 1> <secondary file 2> 


CHIP2 


<file to compile to user chip 2> 











User defined directory of code routines 
that are to be inlined 





INLINEDIR = 








User defined macros info supplied here 


(Leave commented out if not used) 











MACROS = my_macrol/mult_vrlg_64.v 

MY_BLKBOX = my_macrol/blk.v 

MY_NGO_DIR = my_macrol 

MY_INFO = my_macrol/info 

MACROS = my_macro/mult_32to032.vhd \ 
my_macro/add_32.vhd 

MY_BLKBOX = my_macro/blk.v 

MY_NGO_DIR = my_macro 

MY_INFO = my_macro/info 

# 

# Floating point macros selection 


# 
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#FPMODE = SRC_IEEE_V1 Default SRC version IEEE 

#FPMODE = SRC_IEEE_ V2 Size reduced SRC IEEE with 
# special rounding mode 

# 

# User supplied MCC and MFTN flags 

# 

MCCFLAGS = -log -explain_dep -g -keep -use_par 

MF TNE LAGS = -log -v 

# 

# User supplied flags for C & Fortran compilers 

# 

CC = icc icc for Intel cc for Gnu 

FC = ifort ifort for Intel £77 for Gnu 

LD = icc for C codes 

#LD = ifort for Fortran or C/Fortran mixed 

CFLAGS 

FFLAGS = 

LDFLAGS = # Flags to include libs if needed 








VCS simulation settings 
(Set as needed, otherwise just leave commented out) 








US 


EVCS = yes 
VCSDUMP 





ES or yes to use vcs instead of vcsi 
ES or yes to generate vcd+ trace dump 








| 
KK 
(0) 
n 














No modifications are required below 





MAKIN ?= $(MC_ROOT) /opt/srcci/comp/lib/AppRules.make 
include $ (MAKIN) 





b. subr.mc 








#include <libmap.h> 


void subr_map ( 


inte4—-t, ach]; 
arate ncoef, 
inté64_t xcl[], 
int64 t ycll, 
Dae xpts, 
int64_t *timed, 
int mapno) { 





[KOR KKK KK KK KK KR IK I I I OK KK 


* Declarations 
HO KR I I I I OR KK / 
OBM_BANK_A (segend, int64_t, MAX_OBM_SIZ 
OBM_BANK_B (a, int64_t, MAX_OBM_SIZ 
OBM_BANK_C (b, int64_t, MAX_OBM_SIZ 
(Gy 
(x, 


—_ 


OBM_BANK _ inté64_t, MAX_OBM_SIZ!I 
OBM_BANK __ inté64_t, MAX_OBM_SIZI 





AU 
(eh teh ret Pl 
Verve LLyS 
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GI 
—_ 


OBM_BANK_F (y, inté64_t, MAX_OBM_SIZ 


int i,j, nbytes; 

int64_t tm0O, tml, indx; 

int varx,vara,varb, varc,prod3,prod2,prodl1, fx; 
int xg,ag,bg,cg; 


[ORK RR KKK KK OK OK KK IK OK KK OK KK OK 


* Read into OBM. Cooeff & segment endpoints * 
HK AR I I A OR RK RK K / 


// 4 data values (seg,a,b,c), 64bit Hex values 

nbytes = 4*ncoef * 8; 

DMA_CPU (CM20BM, segend, MAP_OBM_stripe(1,"A,B,C,D"), acy 
nbytes, 0); 

wait_DMA (0); 


// Read in the Number of points 

nbytes = xpts * 8; 

DMA_CPU (CM20BM, x, MAP_OBM_stripe(1,"E"), xc, 1, nbytes, 0); 
wait_DMA (0); 








// DEBUG: Tell me I'm in the MAP 
printf (M\n\nk * KKK KKK KK NOW IN MAP KKKKKKKKKE\ HM) » 
printf ("MAP subr ncoef %i xpts %i \n",ncoef,xpts); 


[BORK RR KKK KK KK KK RR I RK IK OK KK KK 


* Read timer and use selector to determine the segment * 
HK KR I I I I RK OK OR KK / 





read_timer (&tm0); 
for (i=0;i<xpts;itt+) 
{ 


split_64t032 (x[i],&xg, &varx) ; 





// SEGMENT INDEX ENCODER 
// Based on x input, determine which index to select 
// the coefficients for approximation 























select_pri_32bit_l6val( varx<= 0Oxl2de, Os 
varx<= 0x2087, 1, 
varx<= Ox2c8c, 2, 
varx<= 0x37a9, 35 
varx<= 0x422b, 4, 
varx<= 0x4c45, D7 
varx<= 0x5613, 6, 
varx<= Ox5faa, 7, 
varx<= 0x6916, 8, 
varx<= 0x7268, 9, 
varx<= Ox7bac, 10, 


varx<= Ox7fff, 11, 
varx<= Ox7fff, 11, 
varx<= Ox7fff, 11, 
varx<= Ox7fff, 11, 
11, &indx); 
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indx = i%12; 

split_64to032 (a[indx],&ag, &vara) ; 
split_64to032 (b[indx], &bg, &varb) ; 
split_64to032 (c[indx],&cg,&varc) ; 





// use macro multiplier 
my_mult (varx,varx,&prodl); // prodl 


x*2 term 





// Perform together 
my_mult (prodl, vara, &prod2); // prod2 = ax*2 term 
my_mult (varx, varb, &prod3); // prod3 bx term 





// Perform final add stage 
//my_add(prod2,prod3,varc, &f£x) ; // 3 input macro adder 
fx = prod2+prod3+varc; 


// Perform final add stage 
// Put result in OBM 
yli] = £x & OxOOOQOOQOQQOOFFFFFFFF; 





// DEBUG: printf for debug information on variable status 


//printf£ ("indx: %3i a[]: %llx varb: %x c: %x x: %x fx: 1x, 
yl]: %11lx\n", 

// indx, a[indx],varb, varc,varx, £x,y[i]); 

// printf ("indx: %3i a: %x b: Sx c: %x x: Sx fx: %lx, yl[]: 
11x\n", 

// indx, vara, varb, varc, varx,fx,yl[i]); 

// printf ("prodl: %x prod2: %x prod3: %x \n", 

// prodl, prod2, prod3); 

} // End for (i=0;i<xpts; itt) 





read_timer (&tml); 
xtimeO = tml-tm0; 


[KOK KKK KK KK KK KI A KK OR KK 


* Send back the results 

HK KR I I I I I I KR I KK / 
nbytes = xpts * 8; 

DMA_CPU (OBM2CM, y, MAP_OBM_stripe(1,"F"), yc, 1, nbytes, 0); 
wait_DMA (0); 











Cc. blk.v 
module mult_32to032(a, b, clk, prod) /* synthesis syn_black_box */ j; 
input 31:0] a; 


if 
input [31.0] “b; 
[ 


output [31:0] prod; 
input clk; 
endmodule 


module add_32(a, b, c, sum) /* synthesis adderparthere */ ; 
input [31:0] a; 
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input [31:0] b; 
input [31:0] ¢; 
output [31:0] sum; 
endmodule 
d. info 
BEGIN_DEF "my_mult" 
MACRO = "mult 32to32"; 
STATEFUL = NO; 
EXTERNAL = NO; 
PIPELINED = YES; 
AATENCY = 7; 
INPUTS = 2: 
IO = INT 32 BITS (a) // explicit input 
I1 = INT 32 BITS (b) // explicit input 
1’ 
OUTPUTS = 1 
OO = INT 32 BITS (prod) // explicit output 
7 
IN_SIGNAL 1 BITS "clk" = "CLOCK"; 
DEBUG_HEADER = # 
void my_mult__dbg (int a, int b, int *prod); 
#; 
DEBUG_FUNC = # 
void my_mult__dbg (int a, int b, int *prod) { 
*prod = axb; 
*prod >>= 32; 
} 
# 
END_DEE 
BEGIN_DEF "my_add" 
MACRO = "add_32"; 
STATEFUL = NO; 
EXTERNAL = NO; 
PIPELINED = NO; 
AATENCY = 1; 
INPUTS =.84 
IO = INT 32 BITS (a) // explicit input 
Il = INT 32 BITS (b) // explicit input 
I2 = INT 32 BITS (c) // explicit input 
1’ 
OUTPUTS = 1 
OO = INT 32 BITS (sum) // explicit output 
7 
DEBUG_HEADER = # 
void my_add__dbg (int a, int b, int c, int *sum); 
tt; 
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DEBUG_FUNC = # 
void my_add__dbg 
*sum = atbtc; 











END_ 











(UES ay, 


int b, 


int c, 


int *sum) { 
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APPENDIX D. COPY OF PROFILE REPORT 


The profile report shows the execution time for non-uniform segmentation with 
the following parameters: ,/—In(x) , €= 2 and N = 1,000,000. Profile reports are used 


to debug functions, optimize files and understand the dynamics and choke points in the 
program. Parent functions and child functions can be analyzed to find the slow points in 


the program. 


The longest times in the report, 62.906s and 50.703s belong to xlabel and ylabel, 
respectively. They were used to display graphs for debugging purposes. Any function 
used to drive graphics is slow compared to computation. In a final version, the display is 


not required and these times do not exist and therefore have no impact. 


The next longest functions are 29.063 seconds and 26.359 seconds which 
correspond to multipleQuadApprox and varQuadApproxHybThirdNew respectively. 
However notice that these are total times. multipleQuadApprox is a parent function to 
varQuadApproxHybThirdNew. Notice too that the column Self Time indicates the amount 
of time that the function actually spends in itself, i.e. the remaining time is spent in the 
child functions. The child function to varQuadApproxHybThirdNew is chebyRemez. 
This makes chebyRemez the longest part of the code. The child functions in chebyRemez 
take up a lot of time, but chebyRemz is the most suitable metric for comparing the speed 


of the different functions. 


Profile Summary 
Generated 28-Jul-2007 08:59:56 





5252 | 1.141 s 





10504 | 0.469 s 
5252 |0.141s 0.141 s 
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specgraph.baseline (Opaque-function) function is recursive 

specgraph.baseline.baseline function is recursive 

specgraph.stemseries (Opaque-function) function is recursive 
specgraph.stemseries.refresh 2 foo6s | function is recursive 
..Stemseries.schema>LdoEdgeColorAction 1 fos | 0.000 s 

... Stemseries.schema>LdoFaceColorAction 1 fos | 0.000 s 


...1es.schema>LdoSetManualCodeModeActio aes . 
oo 3 Os function is recursive 


...aph.stemseries.schema>LdoUpdateAction 0.000 s 
...series.schema>LdoUpdateBaselineAction 0.000 s 
...1es.schema>LdoUpdateChildMarkerAction 0.000 s 

... Series. schema>LdoUpdateChildrenAction function is recursive 

... temseries.schema>LdoUpdateXDataAction 0.000 s 
specgraph.stemseries.setLegendInfo function is recursive 
specgraph\private\checkpvpairs a 
faaooetatinaanie ————‘[t [nore Joos 
Ea CCC 
Ee CC 
jaan ——SSSS~*dO fiefs 

Be i 




















[sniinismck |» James [oases 
mie a fo Juco, 
sme oars fos 
sme sas foes 
fama’ (Sos foes 
mise —SSSSS~*id sso 
sist fi Jos ons 
a CS 


ylabel 5244 | 50.703 s function is recursive 


























Self time is the time spent in a function excluding the time spent in its child functions. 
Self time also includes overhead resulting from the process of profiling. 


176 


APPENDIX E. LESSONS LEARNED 


This section provides information and a record of problems that were encountered 
while using the SRC-6, and other software applications in this thesis. The intent is to 
provide a reference to specific issues previously encountered and to reduce the amount of 


time to resolve or understand them in the future. 


E.l FILE NAMING PROBLEMS 


Problem: When you compile your VHDL code using Xilinx’s ISE Navigator, it accepts 
upper and lower case versions of letters as the same. That is, adderVerilog.wvhd and 
adderverilog.vhd are the same file to Xilinx’s ISE Navigator. However, files in the 
SRC are case sensitive. That is, adderVerilog.vhd and adderverilog.vhd are 
DIFFERENT files in the SRC-6. So, if you have listed adderverilog.vhd in your 
Makefile as a macro, it will not recognize adderVerilog.vhd as the target file. 
Additionally, if you let Xilinx create VHDL code from a schematic which contains the 
module adderVerilog.vhd it will list refer to the module in the VHDL code as 
adderverilog.vhd. 


Solution: Use lower case letters for ALL files. 


Author: J.T. Butler 
Date: 26 FEB 07 


E.2. USING THE CONST CONSTRUCT IN C 


Problem: A martello64 error is obtained when using 


int64_t array[5][5] = { {1,2,3,4,5}; 
{6,7,8,9,10}; 
{ity 125 131-415 
{16,17,18,19,20}; 
{21,22,23,24,25} }; 


The error is caused by “too many accesses to BRAM”. 


Background: This is a correct C construct when used on a PC or workstation. 
However, when it is ina .mc file, this declaration will cause a martello64 error. It is 
possibly due to too many accesses to a BRAM (arrays are usually stored in BRAM). 


This was a problem that Scott Bailey experienced. The initial writeup is based on a 
conversation between Scott Bailey and Jon Butler on December 1, 2006 
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Solution: In discussing this with Dave Caliga, Scott learned that the Carte™ 2.2 version 
should correct this error. At the time the error occurred, we were using Carte™ 2.1. 
Apparently, Carte™ 2.2 spaces out the accesses to BRAM so that it can be changed to 
include ALL 25 data values. However, in order to use it in Carte™ 2.2, you need to 
declare the array as a constant, like so 


const int64_t array[5][5] = { {1,2,3,4,5}; 
(67°78; 971047 
{A1,.1.2,33,14,-1.5}5 
{16,17,18,19,20}; 
{21,22,23,24,25} } 


The intent of const is to set up a constant array that is not changed in the rest of the 
program, much like a ROM instead of RAM. 


Scott Bailey tried to work around this error by simply defining the array without 
populating it with initial values, using, forexample: int64_t array{[5][5]; The 
compiler accepted this. He then put the desired values into array using for loops. 
These arrays will then work as normal C arrays within the .mc code. However, this 
decreases performance, since the values placed into the array must come from either 
OBM or streams, access of which will incur a time penalty. Scott believes that the 
problem is in putting too many values into BRAM too quickly. In a dialog with Dave 
Caliga (SRC Computers), Dave said that the problem occurs when there are more than 8 
initialized values placed in the array. Scott believes that this problem will occur in 
BOTH Carte™ 2.1 and 2.2 for non-constant BRAM arrays. 


Author: J.T. Butler 
Date: 26 FEB 07 


E33 INCORRECT ARGUMENTS IN SYSTEM SUPPLIED MACROS 


Problem: A core dump occurs when the call-by-value and call-by-reference conventions 
are not adhered to 


popcount_64(int64_t a, int array[i]) 
Instead of an error message, there will be a core dump. 


Background: This was provided by Scott Bailey in a conversation with Jon Butler on 
December 1, 2006. 


Solution: To solve this problem, use the following code. 


popcount_64(int64_t a, &temp) 
array[i] = temp; 
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For most system macros, SRC requires that the input values be passed as call-by-value 
(e.g. a) and all output values be done as call-by-reference (e.g. st emp). 

Author: J.T. Butler 

Date: 26 FEB 07 


E.4 IF /THEN/ ELSE LIMITATION 
Problem: When programming in C within the .mc file (no macro) an error occurs when 


the “If, then, else” chain is too long (approx 26 long). 


Background: This was discovered by Prof. Jon Butler when trying to implement a long 
“if,then,else” string during testing. 


Solution: SRC Carte™ V2.2 fixes this problem. 


Author: T.J. Mack 
Date: 26 FEB 07 


E.5 MULTIPLE FILES USED IN A MACRO 


Problem: When using multiple files to describe a circuit in a macro, the SRC won’t 
successfully compile. 


Background: This was discovered while developing the NFG macro where different 
modules are described in separate VHDL files. 


Solution: List all of the VHDL files within the Makefile under macros, separated by a 
space. 


Author: T.J. Mack 
Date: 26 FEB 07 


E.6 XILINX /SYNPLIFY INCONSISTENCIES 


Problem: VHDL code synthesizes correctly (no errors) in Xilinx XST, but does not in 
Synplify PRO. 


Background: When developing VHDL code for the NFG, the code was originally 
written in the Xilinx ISE. Checking for errors using Xilinx XST resulted in no errors. 
When the code was transported to the SRC, errors resulted. Further troubleshooting 
produced the same errors when using the stand-alone Synplify. 


Solution: Not all code is universal. Always test code using a stand-alone version of 
Synplify. If it results in errors, the code must be modified. 

Author: T.J. Mack 

Date: 26 FEB 07 
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E.7 MODELSIM AND MULTIPLE HDL’S 


Problem: ModelSim XE (Xilinx Edition) which is obtained for free from the Xilinx 
website does not support multiple HDL’s. 


Background: When developing the NFG, some code was provided by SRC in Verilog. 
When attempting to analyze the circuit with a test bench, an error occurred in ModelSim. 
The error stated that ModelSim XE does not support multiple HDL’s. 


Solution: Download ModelSim SE. NPS has a license. Details available from Dan 
Zulaica. 

Author: T.J. Mack 

Date: 26 FEB 07 


E.8 INITIALIZING MEMORY FROM A SEPARATE FILE 


Problem: Xilinx allows one to synthesize a ROM where the ROM contents are specified 
in a separate file. When transferring the VHDL files to the SRC and synthesizing with 
Synplify, an error results. This is another artifact of problem F. above. 


Background: Because of the potentially large amount of data needed to load into a 
ROM, it is useful to have a separate file with just this data. The HDL must then access 
this data file during synthesis. 


Solution: Problem not completely solved, yet. Some potential solutions are: 


1. Below is a ROM provided by SRC Computers. Written in Verilog, (SRC 
Computer’s preferred language) it is comprised of 32, 4-input, 1-bit output LUTs. It has 
a 32-bit output. It is initialized using a separate .sdc file. 


module MY_ROM ( 
data, 
adr 
i 
output [31:0] data; 
input [3:0] adr; 


ROM16X1 MO ( 
.O (data[0]), 
. AO (adr[0]), 
-Al (adr[1]), 
.A2 (adr[2]), 
.A3 (adr [3]) 
3 

ROM16X1 M1 ( 
PO) (data[1]), 
.A0 (adr[0]), 
-Al (adr[1]), 
.A2 (adr[2]), 
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ROM16X1 M31 


.A3 (adr[3]) 


al Fill-In Remaining Modules eR 


( 

20 (data[31]), 
. AO (adr[0]), 
-Al (adr[1]), 
.A2 (adr[2]), 
.A3 (adr[3]) 
i 

endmodule 











The ROM initialization values are in the .sdc file below. The INITs are 





somewhat cumbersome, since the LUTs are 1-bit wide. So each of the LUTs has one bit 
position for all of the 16 values. The INIT values essentially represent a 32 row by 16 
column matrix. Each column represents one of 16, 32-bit outputs. 














define_attribute {i1:M0} xc_props "“INIT=ba5d" 
define_attribute {i:Ml} xc_props "INIT=8801" 





7 Fill-In Missing Values *** 


define_attribute {i:M31} xc_props “"INIT=1321" 


This is the most promising example of a ROM with an external file for 


initialization. However, the 1-bit format of the init values makes it difficult to 
implement. 


Below is another ROM example provided by SRC Computers. It uses the 


RAMB16_S18_S18 module which is a 16 Kb Block RAM with two 18-bit outputs (16- 
bits plus 2-bits for parity). It is initialized using the xc_props lines within the same 


file. 


module MY_ROM ( 


in 


din_0O, 
dout_0, 
din_l, 
dout_l, 
adr_0, 
adr_l, 
w_en_0O, 
w_en_l, 
clk 

) . 


put [15:0] din_0; 


output [15:0] dout_0; 


in 


put [15:0] din_l; 


output [15:0] dout_l; 


in 
in 


put [9:0] adr_0; 
put [920] adr_l; 


input w_en_0; 
input w_en_l; 


in 





put clk /* synthesis syn_noclockbuf=1 */ ; 
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RAMB16_S18_S18 MO ( 


.DOA (dout_0[15:0]), 
.DOB (dout_1[15:0]), 
.DOPA Q, // ignore the parity outputs 
.DOPB (5 // ignore the parity outputs 


-ADDRA (adr_0) 
.ADDRB (adr_1) 


~ oS 


~CLKA (clk), 

.CLKB (clk), 

-DIA (din_0[15:0]), 

.DIB (din_1[15:0]), 

-DIPA (2'bO), // zero the parity inputs 
-DIPB (2'b0), // zero the parity inputs 
.ENA (1'bl), 

. ENB (1'bl), 

.SSRA (1760) » 

- SSRB (1'bO), 

.WEA (w_en_0), 

.WEB (w_en_1) 

) /* synthesis 





xc_props="INIT_00=76931fac9dab2b36c248b87d6ae33£9a62d7183a5d5789e4b2d6b441e2411dc7, \ 
INIT_01=09el1lllc7ele7acb6f8cac0bb2fc4c8bc2ae3baaab9165cc458e199ch89F51b13, \ 
INIT_02=5£7091a5abb0874df£3e8cb4543a5eb93b0441e9ca4c2b0 fb3d30875cb£29abd5, \ 
INIT_3e=la0bf 9b00ffd21b6210b11dc59ec947be86d1llel0de2e980b8bc988e26aba269, \ 


7H Fill-In Missing Values *** 


INIT_3f=ac6bd4cd2bf0471f£cb95377922449de5393850a00a57b47800d374d961ldfeb5" */ ; 


endmodule 


Be The following code is a 16 x 32-bit ROM written in Verilog. It will 
synthesize in Xilinx XST, but not in Synplify PRO. 


module romverlog(input [3:0] raddr, output [31:0] slope_int); 


reg [15:0] mem [31:0]; 


initial 
begin 
Sreadmemb ("memory.mem", mem) ; 
end 
assign slope_int = mem[raddr]; 
endmodule 


The associated memory.mem file is a simple, binary text file with the memory 
initialization values. 


00000110010001000000000000000000 
00000110001011010000000000000000 
00000101111111110000000000000100 
00000101101110100000000000001100 
00000101011000000000000000011010 
00000100111100010000000000101111 
00000100011100000000000001001101 
00000011110111110000000001110100 
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00000011001111110000000010100101 
00000010100100110000000011100001 
00000001110111100000000100100111 
00000001001000010000000101110111 
00000000011000000000000111001111 
00000001110111100000000100100111 
1 
uf 











00000001001000010000000101110111 
00000000011000000000000111001111 

















Author: T.J. Mack 
Date: 26 FEB 07 


E.9 MACRO LATENCY AND SRC OVERHEAD 


Problem: When implementing a macro, SRC requires additional clocks to accomplish 
overhead operations. The overhead appears to be 5 clock cycles to pass data to a macro 
and an additional 5 clock cycles to receive data from a macro. One would expect a macro 
with a latency of 3 to take a total of 13 clock cycles. However, it takes only 12. The last 
clock cycle is absorbed into the 5 clock cycles needed to receive data from the macro. In 
this case, the /atency in the info file must be set equal to 2, even though the schematic 
may show a latency of 3. 


Background: When developing the NFG, pipeline depth reports for the loop that calls 
the NFG macro were always 10 clock cycles more. 


Solution: No solution. This is a characteristic of the SRC architecture. 


Author: T.J. Mack 
Date: 26 FEB 07 


E.10 CANNOT USE PRIORITY SELECTOR GREATER THAN 128 


Problem: When implementing a priority selector with 256 elements, 64 bits wide, I 
could not compile the .mc file. This is because the architecture already had 3 64 bit wide 
multipliers and other hardware that consumed some of the resources. However, if you 
don’t need all 256 priority selectors, it would be nice to have a selector that is greater 
than 128, and smaller than 256. 


Background: When implementing the priority selectors with 150 elements, the only 
option for a single selector is to use the 256 selector, but that is 106 more elements than 
required. 

Solution: Use multiple selectors of smaller sizes. 

Author: N. Macaria 

Date: 26JULO7 
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E.11 IF-THEN-ELSE STATEMENT WITH SRC PRIORITY SELECTORS 


Problem: When implementing multiple priority selectors in the .mc file, SRC would not 
accept an if-then-else statement to contain priority selectors in the body. 


Background: When running the program, it would not compile if a priority selector was 
used inside an if-then-else statement.. 


Solution: Put the if-then-else statement prior to the priority selector, use a variable to 
store the selector you want to pick, then use a case statement to reach that selector. 


Author: N. Macaria 
Date: 26JULO7 


E.12 FIND THE SLOW CODE IN MATLAB PROGRAMS 


Problem: When running MATLAB programs, sometimes the code takes very long to 
execute and you may not be sure where the problem exists. 


Background: When running the chebyRemz, program, there were portions of code that 
would take very long to run. 


Solution: Put the if-then-else statement prior to the priority selector, use a variable to 
store the selector you want to pick, then use a case statement to reach that selector. 


Author: N. Macaria 
Date: 26JULO7 
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APPENDIX F. SEGMENT ESTIMATION EQUATION 


The segment estimation equation is derived from analyzing the Chebyshev 
approximation error equation (0.6) is the general case: 


2 b- d+l1 r 
: Fae (x) (0.6) 


The variable d is the order of the approximation to be used. For the case of quadratic 
approximation, d=2 and (b-a) is the estimated width of the segment. 
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